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Originally, there was just experimental science, and 
then there was theoretical science, with Kepler’s 
Laws, Newton’s Laws of Motion, Maxwell’s equations, 
and so on. Then, for many problems, the theoreti¬ 
cal models grew too complicated to solve analytically, 
and people had to start simulating. These simulations 
have carried us through much of the last half of the 
last century. At this point, these simulations are gen¬ 
erating a whole lot of data, along with a huge increase 
in data from the experimental sciences.” 


Jim Gray, 2007 
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In view of the paradigm shift that makes science ever more data-driven, in this 
thesis we propose a synthesis method for encoding and managing large-scale de¬ 
terministic scientific hypotheses as uncertain and probabilistic data. 

In the form of mathematical equations, hypotheses symmetrically relate as¬ 
pects of the studied phenomena. For computing predictions, however, deterministic 
hypotheses can be abstracted as functions. We build upon Simon’s notion of struc¬ 
tural equations in order to efficiently extract the (so-called) causal ordering between 
variables, implicit in a hypothesis structure (set of mathematical equations). 

We show how to process the hypothesis predictive structure effectively through 
original algorithms for encoding it into a set of functional dependencies (fd’s) and 
then performing causal reasoning in terms of acyclic pseudo-transitive reasoning 
over fd’s. Such reasoning reveals important causal dependencies implicit in the hy¬ 
pothesis predictive data and guide our synthesis of a probabilistic database. Like in 
the field of graphical models in AI, such a probabilistic database should be normal¬ 
ized so that the uncertainty arisen from competing hypotheses is decomposed into 
factors and propagated properly onto predictive data by recovering its joint prob¬ 
ability distribution through a lossless join. That is motivated as a design-theoretic 
principle for data-driven hypothesis management and predictive analytics. 

The method is applicable to both quantitative and qualitative deterministic 
hypotheses and demonstrated in realistic use cases from computational science. 



Resumo da Tese apresentada ao LNCC/MCT como parte dos requisites necessaries 
para a obtengao do grau de Doutor em Ciencias (D.Sc.) 

GERENCIA DE HIPOTESES CIENTIFICAS DE LARGA-ESCALA 
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Orientador: Fabio Porto, D.Sc. 

Tendo em vista a mudanga de paradigma que faz da ciencia cada vez mais guiada 
por dados, nesta tese propomos um metodo para codificagao e gerencia de hipoteses 
cientfficas deterministicas de larga escala como dados incertos e probabilisticos. 

Na forma de equagoes matematicas, hipoteses relacionam simetricamente as- 
pectos do fenomeno de estudo. Para computagao de predigoes, no entanto, hipote¬ 
ses deterministicas podem ser abstraidas como fungoes. Levamos adiante a nogao 
de Simon de equagoes estruturais para extrair de forma eheiente a entao chamada 
ordenagao causal implicita na estrutura de uma hipotese. 

Mostramos como processar a estrutura preditiva de uma hipotese atraves de 
algoritmos originais para sua codiheagao como um conjunto de dependencias fun- 
cionais (df’s) e entao realizamos inferencia causal em termos de raciocinio aciclico 
pseudo-transitivo sobre df’s. Tal raciocinio revela importantes dependencias cau- 
sais implicitas nos dados preditivos da hipotese, que conduzem nossa sintese do 
banco de dados probabilistico. Como na area de modelos graficos (lA), o banco de 
dados probabilistico deve ser normalizado de tal forma que a incerteza oriunda de 
hipoteses alternativas seja decomposta em fatores e propagada propriamente re- 
cuperando sua distribuigao de probabilidade conjunta via jungao ‘lossless.’ Isso e 
motivado como um principio teorico de projeto para gerencia e analise de hipoteses. 

O metodo proposto e aplicavel a hipoteses deterministicas quantitativas e 


qualitativas e e demonstrado em casos realisticos de ciencia computacional. 
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Chapter 1 


Introduction 


In view of the paradigm shift that makes science ever more data-driven [T] , in this 
thesis we demonstrate that large deterministic scientific hypotheses can he effec¬ 
tively encoded and managed as a kind of uncertain and probabilistic data. 

Deterministic hypotheses can be formed as principles or ideas, then expressed 
mathematically and implemented in a program that is run to give their decisive 


form of data (see Fig. 1.1). Hypotheses can also be learned in large scale, as exhib¬ 
ited in the Eureqa project [2]. Examples of ‘structured deterministic hypotheses’ 
include tentative mathematical models in physics, engineering and economical sci¬ 
ences, or conjectured boolean networks in molecular biology and social sciences. 
These are important reasoning devices, as they are solved to generate valuable 
predictive data for decision making in science and increasingly in business as well. 

In fact, we can refer nowadays to a broad, modern context of data science [3] 
and big data |1] in which the complexity and scale of so-called ‘data-driven’ prob¬ 
lems require proper data management tools for the predicted data to be analyzed 
effectively. In this thesis, we pay attention to a quite general class of (tentative) 
computational science models]^ and we look at them in an original way as a 
distinguished kind of data source. 


^ ‘Computational science’ is (sic.) “a rapidly growing multidisciplinary field that uses ad¬ 
vanced computing capabilities to understand and solve complex problems” [5]. We may refer 
to non-stochastic, tentative computational science models throughout this text as ‘structured 
deterministic hypotheses.’ 
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“If a body falls from rest, its velocity at any point 
is proportional to the time it has been falling. ” 

(i) 


a{t) = -g 

v(i) = -gt + Vo 

s{f) = -{gl2)t^ + vot + So 

(ii) 


for k = 0:n; 
t = k * dt; 

V = -g*t + v_0; 

s = -(g/2)*t~2 + v_0*t + s_0; 

t_plot(k) = t; 

v_plot(k) = v; 

s_plot(k) = s; 

end 

(iii) 


FALL 

t 

V 

s 


0 

0 

5000 


1 

-32 

4984 


2 

-64 

4936 


3 

-96 

4856 


4 

-128 

4744 


(iv) 


Figure 1.1. Multi-fold view of a deterministic scientific hypothesis. 


It is generally considered that computational science models, interpreted here 
as hypotheses to explain real-world phenomena, are of strategic relevance [5] . They 
are usually complex in that they may have hundreds to thousands of intertwined 
(coupled) variables and be computed along space, time or frequency domains in 
arbitrarily large scale. It is important to note the distinction between the structure 
and data levels. Consider, say, Lotka-Volterra’s model, which essentially consists 


in (Eqs. 1.1) two ordinary differential equations, complemented by seven sub¬ 
sidiary equations /i(f), f 2 {xo), fsivo), Uib), hip), fair), hid) to set the values of 
its domain variable t and (input) parameters xq, yo, b, p, r, d. 


X = x(6 — py) 
y = yirx — d) 


( 1 . 1 ) 


In a sense, it can be said fairly simple, as it is characterized by a set S of equations 
and a set V of variables, sized \£\ = |V| = 9. Yet, at the data level this model (cf. 
Chapter 1^ can be made very large just by computing its predictions in a fine time 
resolution and/or along an extended time window. 

As we shall see shortly, the technical challenges associated with this thesis 
involve (not only but) majorly the structure level where, e.g., such Lotka-Volterra 
model can be abstracted as a deterministic structure 5(£^,V) with |5| = ISj^ 

^ The structure length |iS| is a measure of how dense the hypothesis structure is, comprising 
the total sum of the number of variables appearing in each equation. 












3 


We are really concerned here with models whose structure S is in the order of 
|5| < IM, and whose results (data!) shall be difficult to analyze by handicrafted 
practice. Note that the data level of a model can be set as large as wanted (set 
the domain resolution and/or extension accordingly), but it shall be necessarily 
large when its structure is itself large. By ‘large-scale hypotheses’ then we mean 
tentative deterministic models that are large at structure level. 

Overall, such class of hypotheses can be said to qualify to at least four of 
the hve v’s associated to the notion of big data0 value, because of their role 
in advancing science and technology; volume, due to the large scale of modern 
scientihc problems; variety, because of their structural heterogeneity, even when 
they refer to the same phenomena; and veracity, due to their uncertainty. 

The idea of managing hypotheses ‘as data’ may sound intriguing and in fact 
it raises a number of research questions of both conceptual and technical nature]^ 

We start by outlining below the conceptual research questions. 

RQl. How to dehne and encode hypotheses ‘as data’? What are the sources of 
uncertainty that may be present and should be considered? 

RQ2. How does hypotheses ‘as data’ relate with observational data or, likewise, 
phenomena ‘as data’ from a database perspective? 

RQ3. Does every piece of simulated data qualify as a scientihc hypothesis? What 
is the difference between managing ‘simulation’ data from managing ‘hy¬ 
potheses’ as data? 

RQ4. Is there available a proper (machine-readable) data format we can use to 
automatically extract mathematically-expressed hypotheses from? 


It has been a challenge of this thesis to provide reasonable answers to these ques¬ 


tions, which are brought together into the vision of hypotheses ‘as data’ (we call 


it the T-DB vision) and its use case that we present in Chapter]^ and experiment 
with in realistic scenarios in Chapter 


^ The ‘v’ of velocity may appear in connection with machine learning hypotheses, which we 
discuss in Chapter 


We shall keep record of those questions and revisit them in {7.1 
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The T-DB vision formulates the problem of hypothesis encoding as a problem 
of probabilistic database design. A number of technical questions arise then. 


We introduce now technical context, materials and methods identihed and 
selected in this thesis as a basis to realize the T-DB vision in terms of probabilistic 
database design. We shall outline in the sequel the technical research questions to 
be answered by the core of the thesis. 


1.1. Problem Space and Specific Goals 

It has been a goal of this thesis to investigate the capabilities of probabilistic 
databases to enable hypothesis data management as a particular case of simulation 
data management. In the sequel, we hrst characterize the use case of hypothesis 
data management and then formulate it in terms of probabilistic DB design. 


1.1.1 Simulation data management 

Simulation laboratories provide scientists and engineers with very large, pos¬ 
sibly huge datasets that reconstruct phenomena of interest in high resolution. No¬ 
torious examples are the John Hopkins Turbulance Databases [ 6 ], and the Human 
Brain Project (HBP) neuroscience simulation datasets [7j. A core motivation for 
the delivery of such data is enabling new insights and discoveries through hypoth¬ 
esis testing against observations. Nonetheless, while the use case for exploratory 
analytics is currently well understood and many of its challenges have already been 
coped with so that high-resolution simulation data is increasingly more accessible 
IHli, only very recently, as part of this thesis work, the use case of hypothesis 
management has been taken into account for predictive analytics |10j . 

In fact, there is a pressing call for innovative technology to integrate (ob¬ 
served) data and (simulated) theories in a unihed framework [TTl [121 US]- The 
point has just been raised by leading neuroscientists in the context of the HBP, 
who are incisive on the compelling argument that massive simulation databases 
should be constrained by experimental data in corrective loops to test precise hy¬ 


potheses m p. 28]. Fig. |1.2| shows a simplihed view of the (data-driven) scientihc 


method life cycle. It distinguishes the phases of exploratory analytics (context of 
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Figure 1.2. A view of the scientific method life cycle. It highlights hypothesis 
formulation and a backward transition to reformulation if predictions ‘disagree’ 
with observations. 


discovery) and predictive analytics (context of justihcation), and highlights the 
loop between hypothesis formulation and testing |15j . 

Simulation data, being generated and tuned from a combination of theo¬ 
retical and empirical principles, has a distinctive feature to be considered when 
compared to data generated by high-throughput technology in large-scale scien- 
tihc experiments. It has a pronounced uncertainty component that motivates the 
use case of hypothesis data management for predictive analytics [10]. Essential 
aspects of hypothesis data management can be described in contrast to simulation 
data management as follows — Table 1.1 


summarizes our comparison. 


• Sample data. Hypothesis management shall not deal with the same volume 
of data as in simulation data management for exploratory analytics, but 
only samples of it. This is aligned, for example, with the architectural de¬ 
sign of CERN’s particle-physics experiment and simulation ATLAS, where 
there are four tier/layers of data. The volume of data signihcantly de¬ 
creases from (tier-0) the raw data to (tier-3) the data actually used for 
analyses such as hypothesis testing (8] p. 71-2]. Samples of raw simula- 


Table 1.1. Simulation data management vs. hypothesis data management. 


Simulation data management 

Exploratory analytics 
Raw data 

Extremely large (TB, PB) 
Dimension-centered access pattern 
Denormalized for faster retrieval 
Batch-, incremental-only data updates 


Hypothesis data management 

Predictive analytics 
Sample data 
Very large (MB, GB) 
Claim-centered access pattern 
Normalized for uncertainty factors 
Probability distribution updates 
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tion data are to be selected for comparative studies involving competing 
hypotheses in the presence of evidence (sample observational data). This 
principle is also aligned with how data is delivered at model repositlories. 
Since observations are usually less available, only the fragment (sample) 
of the simulation data that matches in coordinates the (sample) of obser¬ 
vations is required out of simulation results for comparative analysis. For 


instance, we show in §6.2.2| a predictive analytical study extracted from the 
Virtual Physiological Rat Project (VPRIOOI-M) comparing sample simu¬ 
lation data (heart rates) from a baroreflex model with observations on a 
Dahl SS rat strain]^ The simulation is originally set to produce predictions 
in the time resolution of = 0.01. But since the observational sample is 
only as hne as = 0.1, there is no gain in rendering a predicted sample 
with >0.1 for hypothesis testing. Note that such a ‘sampling’ does not 
incur in any additional uncertainty as typical of statistical sampling |T6] . 


• Claim-centered access pattern. In simulation data management the access 
pattern is dimension-centered (e.g., based on selected space-time coordi¬ 
nates) and the data is denormalized for faster retrieval, as typical of Data 
Warehouses (DW’s) and OLAP applicationsj^ In particular, on account 
of the so-called ‘big table’ approach, each state of the modeled physical 
system is recorded in a large, single row of data. This is fairly reasonable 
for an Extract-Transform-Load (ETL) data ingesture pipeline character¬ 


ized by batch-, incremental-only updates (see Fig. 1.3). Such a setting is 
in fact £t for exploratory analytics, as entire states of the simulated system 
shall be accessed at once (e.g., providing data to a visualization system). 
Altogether, data retrieval is critical and there is no risk of update anoma¬ 
lies. Hypothesis management, in contrast, should be centered on claims 
identihed within the hypothesis structure w.r.t. available data dependen¬ 
cies. Since the focus is on resolving uncertainty for decision making (which 


® http://virtualrat.org/computational-models/vprlOOl/, 

® On-Line Analytical Processing, as distinguished from OLTP (On-Line Transaction Process¬ 
ing. The latter is meant for transaction processing of daily queries and updates in operational 
systems, while the former is for analytical queries in Data Warehouses (DW’s) that gather a lot 
of data collected from different sources for decision making. 
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ETL 





Figure 1.3. The usual data ingesture pipeline of simulation data management. 
Datasets ljr=i generated by simulation trials on (hypothesis) models are 
loaded each into a ‘big’ table ljf=i R- The uncertainty is then “buried” in the 
database, which lacks a logical organization for enabling data-driven hypothesis 
management and predictive analytics. 


hypothesis is a best fit?), the data must be normalized based on uncer¬ 
tainty factors. This is key for the correctness of uncertainty modeling and 
efficiency of probabilistic reasoning, say, in a probabilistic database m 
p.30-1]. 

• Uncertainty modeling. In uncertain and probabilistic data management 
na, the uncertainty may come from two sources: incompleteness (miss¬ 
ing data), and multiplicity (inconsistent data). Hypothesis management 
on sample simulation data is concerned with the multiplicity of predic¬ 
tion records due to competing hypotheses targeted at the same studied 
phenomenon. Such a multiplicity naturally gives rise to a probability dis¬ 
tribution that may be initially uniform and eventually conditioned on ob¬ 
servations. Conditioning is an applied Bayesian inference problem that 
translates into database update for transforming the prior probability dis¬ 
tribution into a posterior [TO] . 

Overall, hypothesis data management is also OLAP-like, yet markedly dif¬ 
ferent from simulation data management. 


A key point that distinguishes hypothesis management is that a fact or unit of 
data is defined by its predictive content. That is, every clear-cut predicted 
fact (w.r.t.available data dependencies) is a claim. Accordingly, the data should 
be decomposed and organized for a claim-centered access pattern. 
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conditioning 



Figure 1.4. Pipeline for processing hypotheses as nncertain and probabilistic 
data. For each hypothesis k, its structnre Sk is given in a machine-readable 
format, and all of its sample simulation data trials lJi=i '^1 indicated their 
target phenomenon, say 0, to be loaded into a ‘big table’ H^- Then the synthesis 
comes into play to read a base of possibly very many hypotheses Ufc=i Hr and 
transform them into a probabilistic database where each hypothesis is decom¬ 
posed into claim tables IJ^i ^ probability distribution is computed for each 
phenomenon 0, covering all the hypotheses and their trials targeted at 0. This 
distribution is then updated into a posterior in the presence of observational data. 


To anticipate Chapter the synthesis method we have developed in this 
thesis work for processing hypotheses as uncertain and probabilistic data comprises 


a design-theoretic pipeline (see Fig. 1.4) that extends the one shown in Fig. 1.3 


1.1.2 Probabilistic database design 

Probabilistic databases (p-DB’s) have evolved into mature technology in the 
last decade with the emergence of new data models and query processing techniques 
czi- One of the state-of-the-art probabilistic data models is the U-relational repre¬ 
sentation system with its probabilistic world-set algebra (p-WSA) implemented in 
MayBMS [T8] . That is an elegant extension of the relational model we shall refer to 
in this thesis for the management of large-scale uncertain and probabilistic data. 

We look at U-relations from the point of view of p-DB design, for which no 
formal design methodology has yet been proposed. Despite the advanced state of 
probabilistic data management techniques, a lack of methods for the systematic 
design of p-DBs may prevent wider adoption. The availability of design methods 
has been considered one of the key success factors for the rapid growth of applica¬ 
tions in the held of Graphical Models (GM’s) [19], considered to inform research 
in p-DB’s (IT] p. 14]. Analogously, we have proposed to distinguish methods for 
p-DB design in three classes [10]: (i) subjective construction, (ii) learning from 
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data, and (iii) synthesis from other kind of formal specification. 

The first is the less systematic, as the nser has to model for the data and 
correlations by steering all the p-DB construction process (MayBMS’ use cases [18] . 
e.g., are illustrated that way). The second comprises analytical techniques to ex¬ 
tract the data and learn correlations from external sources, possibly unstructured, 
into a p-DB under some ad-hoc schema. This is the prevalent one up to date, 
motivated by information extraction and data integration applications m p- 10- 
3]. In this thesis we present a methodology of the third kind, as we extract data 
dependencies from some previously existing formal specification (the hypothesis 
mathematical structure) to synthesize a p-DB algorithmically. Such a type of con¬ 
struction method has been successful, e.g., for building Bayesian Networks [19]. To 
our knowledge, this thesis is the hrst synthesis method for p-DB design (cf. 1 ]5.6 ). 

We shall develop means to extract the specification of a hypothesis and encode 
it into a U-relational DB for data-driven hypothesis management and analytics. 
That is, we shall flatten deterministic hypotheses into U-relations. 

The synthesis method that we have developed for p-DB’s relies on the ex¬ 
traction of functional dependencies (fd’s; cf. [201 EH E2]) that are basic input to 
algorithmic synthesis]^ For an example of fd, consider relation FALL in Fig. O 
There holds an fd f —>■ vs, meaning that values of attribute time t functionally 
determine values of both attributes velocity v and position s. More precisely, let 
H and r be any two tuples (rows) in an instance of relation (table) FALL. Then it 
satisfies fd f —?• vs iff /i[f] = r[t] implies /i[vs] = r[vs]. In our illustrative relation 
FALL, that fd is, in particular, a key constraint, which means that (values of) t 
play the role of a key to (provide access to the values of) v and s in the relation. 

A related concept which is also a major one for us is that of normalization 
[20l [211E2]; to ensure that the DB resulting from a design process bears some 
desirable properties which are associated with some notion of normal form (ibid.). 
For hypothesis management, the uncertainty has to be modeled and should be 
normalized so that the uncertainty of one claim may not be undesirably mixed 

^ In fact, it has been considered a critical failure in traditional DB design the lack of techniques 
to obtain important information such as fd’s in the real world [231 p. 62]. 
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with the uncertainty of another claim. It is expected to involve a processing of the 
causal dependencies implicit in the given hypothesis structure. We shall introduce 
in detail such concepts in context when necessary. 

1.1.3 Structural equations 

The flattening of the user mathematical models into hypothesis p-DB’s, nonethe¬ 
less, is not straightforward. It has been a goal of this thesis to investigate proper 
abstractions on mathematical models in order to (partly) capture their semantics, 
viz., to an extent that is tailored for hypothesis management (as opposed to, say, 
model solving). We shall abstract mathematical models into intermediary artifacts 
that are amenable to be further encoded into fd’s. 

In fact, given a system of equations with a set of variables appearing in 
them, in a seminal article Simon introduced an asymmetrical, functional relation 
among variables that establishes a (so-called) causal ordering [21]. That became 
known as structural equation models (SEM’s) or just ‘structural equations’ (cf. also 
[25]). Along these lines, our goal is to extract the causal ordering implicit in the 
structure of a deterministic hypothesis into a set of fd’s that guides our synthesis 
of U-relational DB’s. As we shall see throughout this text. 


the causal ordering we capture and process through fd’s provides causal de¬ 
pendencies implicit in the predictive data that are very useful information to 
decompose uncertainty for the sake of probabilistic modeling and reasoning. 


1.1.4 Uncertainty Model 

In uncertain and probabilistic data management, there are essentially two sources 
of uncertainty: incompleteness (missing data), and multiplicity (inconsistent data). 


The kind of uncertainty that is dealt with in this work is the multiplicity of hy¬ 
pothesis trial records identihed to be targeted at the same phenomenon record. 
That is, the uncertainty arises from the existance of competing hypotheses. 
If multiple hypotheses and trials are inserted for the same phenomenon, the 
system interprets it as defining a probability distribution. 
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Such a probability distribution (usually uniform) on the multiplicity of com¬ 
peting hypotheses is in accordance with probability theory under possible-worlds 
semantics m Ch. 1], It is modeled into the U-relational data model and its 
p-WSA operators, and implemented into the MayBMS system as we shall see in 
The conf() aggregate operator, for instance, in spite of the name, performs 
standard (non-Bayesian) probabilistic inference on such probability distribution. 
Eventually, however, there is a need to condition the initial probability distribu¬ 
tion in the presence of observations. For the conditioning, then, we shall adopt 
Bayesian inference so that the prior probability distribution can be updated to a 
posterior. 

The informal discussion of this section opens the way for a number of tech¬ 
nical research questions that we outline next. 

RQ5. Is there an algorithm to, given a SEM, efficiently extract its causal order¬ 
ing? What are the computational properties of this problem? 

RQ6. What is the connection between SEM’s and fd’s? Can we devise an en¬ 
coding scheme to ‘orient equations’ and then effectively transform one into 
the other with guarantees? Once we do it, what design-theoretic properties 
have such a set of fd’s? 

RQ7. Is such fd set ready to be used for p-DB schema synthesis as an encoding 
of the hypothesis causal structure? If not, what kind of further processing 
we have to do? Can we perform it efficiently by reasoning directly on the 
fd’s? How does it relate to the SEM’s causal ordering? 

RQ8. Is the uncertainty decomposition required for predictive analytics reducible 
to the structure level (fd processing), or do we need to process the simulated 
data to identify additional uncertainty factors? Finally, what properties 
are desirable for a p-DB schema targeted at hypothesis management? Are 
they ensured by this synthesis method? 

RQ9. Given all such a design-theoretic machinery to process hypotheses into 
(U-)relational DB’s, what properties can we detect on the hypotheses back 



® Our own system of hypothesis management is to be delivered on top of the MayBMS backend. 
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at the conceptual level? Do we have now technical means to speak of 
hypotheses that are “good” in terms of principles of the philosophy of 
science? 

The core of this thesis is devoted to answer these questions, and we shall accomplish 
it throughout Chapters and 


1.2. Thesis Statement 

The statement of this thesis is that it is possible to effectively encode and 
manage large deterministic scientific hypotheses as uncertain and probabilistic 
data. Its key challenges are of both conceptual and technical nature. Concep¬ 
tually, we provide core, non-obvious abstractions to define and encode hypotheses 
as data. Technically, we provide a number of algorithms that compose a design- 
theoretic pipeline to encode hypotheses as uncertain and probabilistic data, and 
verify their efficiency and correctness. The applicability and effectiveness of our 
method is demonstrated in realistic case studies in computational science. 

Besides, it is worthwhile highlighting some non-goals of this thesis. 


Nl. Although we perform some sort of information extraction [26] for the ac¬ 
quisition of hypotheses from some model repositories on the web, it is very 
basic and ad-hoc in order to obtain a testbed for our method. That is, 
we are not proposing means for the systematic extraction of hypotheses 


from available sources. In fact we shall outline it in §7.3| as an important 
direction of future work. 


N2. We do not address solving computational models or numerical analytics 
in any sense. In fact we rely on the numerical solvers (implemented into 
tools that we use) as ‘transaction processing’ systems, load their computed 
data into a relational ‘big’ fact table and then render it into U-relational 
tables synthesized by our method. We do not deal with data visualization 
either in any sense. 

N3. The efficiency and scalability of query processing in p-DB’s, in particular U- 
relational’s MayBMS and its p-WSA (which we rely on) is not addressed or 
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evaluated in this thesis. In fact, the performance of U-relations and p-WSA 
has been extensively evaluated and shown to be effective [?n IT8] . All per¬ 
formance tests carried out in this thesis comprise our design-theoretic tech¬ 
niques for the encoding and synthesis of U-relational hypothesis databases. 

N4. In terms of uncertainty and statistical analysis, we stick to (i) process some 
well-dehned forms of multiplicity in the data which constitute the model of 
uncertainty dealt with in this work; then (ii) by relying on MayBMS we per¬ 
form probabilistic inference; and (iii) eventually (at application level) we 
perform Bayesian inference and so that a posterior probability distribution 
is propagated through p-DB updates. We do not provide any additional 
form of uncertainty management. Rather, we manage the data extracted 
into the system (under user control) and process its uncertainty in terms 
of the specihc sources of uncertainty recognized in T-DB (cf. Chapter]^. 

1.3. Thesis Contributions 

The contributions of this thesis are outlined as follows. 

1.3.1 Innovative Contributions 

This thesis presents the vision of hypotheses as data (and its use case) so- 
called T-DB vision. It has been published in the vision track of VLDB 2014 [lOj . 
for its (sic.) potentially high-impact visionary content. The innovative system of 
T-DB has been described in a ‘system prototype demonstration’ paper [28] 

1.3.2 Technical Contributions 

This thesis presents specihc technical developments over the T-DB vision. In 
short, it shows how to encode deterministic hypotheses as uncertain and prob¬ 
abilistic data. Our detailed technical contributions (cf. Chapters § g and 15) 
are formulated into a formal method for the design of hypothesis p-DB’s which is 
described in a technical report [29||5] The method, together with our realistic 
testbed scenarios and performance evaluation, are yet to be published. 

® Preliminary version available at CoRR abs/1411.7419 
Preliminary version available at CoRR abs/1411.5196 
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1.4. Thesis Outline 

The structure of the remainder of this thesis is outlined for reference. 

Chapter [T-DB Vision], The research vision of hypotheses as (uncertain and 
probabilistic) data, the characterization of its use case, key points and technical 
challenges are presented. 

Chapter [Encoding], The problem of encoding a hypothesis ‘as data’ given 
its formal specification (set of mathematical equations) is presented and addressed 
by an encoding scheme that transforms the equations into fd’s with guarantees in 
terms of preserving the hypothesis causal structure. 

Chapter]^ [Causal Reasoning], It is presented a technique for causal reasonig 
as acyclic pseudo-transitive reasoning over the encoded fd’s. It processes the hy¬ 
pothesis causal ordering to find the ‘first causes’ for each of its predictive variables. 

Chapter]^ [p-DB Synthesis], It is presented a technique to address the problem 
of uncertainty introduction and propagation for the transformation of hypotheses 
into U-relational databases. The synthesized U-database is shown to bear desirable 
properties for hypothesis management and predictive analytics. 

Chapter [Applicability], A discussion of applicability, the implementation 
of the proposed techniques into a prototype system for test and demonstration of 
the vision realization through realistic case studies are presented. 

Chapter]^ [Conclusions], Research questions are revisited, and the significance 
and limitations of the thesis with directions to future work and hnal considerations 


are discussed. 



Chapter 2 


Vision: Hypotheses as Data 


High-throughput technology and large-scale scientihc experiments provide 
scientists with empirical data that has to be extracted, transformed and loaded 
before it is ready for analysis [1]. In this vision we consider theoretical data, or 
data generated by simulation from deterministic scientihc hypotheses, which also 
needs to be pre-processed to be analyzed. 

Hypotheses as data. In view of the age of data-driven science, we consider 
deterministic scientihc hypotheses from a multi-fold point of view: formed as prin¬ 
ciples or learned in large scalej^ hypotheses are formulated mathematically and 


coded in a program that is run to give their decisive form of data (see Fig. 1.1). 


Uncertain data. The semantic structure of relation FALL (Fig. 1.1), item 
(iv) can be expressed by the functional dependency (fd) t —)■ v s. This is typical 
semantics assigned to empirical data in the design of experiment databases. A 
space-time dimension (like time t in our example) is used as a key to observables 
(like velocity v and position s). In empirical uncertainty, it is such “physical” 
dimension keys like t that may be violated, say, by alternative sensor readings. 

Hypotheses, however, are tentative explanations of phenomena [15], which 
characterizes a diherent kind of uncertain data. In order to manage such theoretical 
uncertainty, we shall need two special attributes to compose, say, the epistemolog¬ 
ical dimension of keys to observables: 0, identifying the studied phenomena; and 
V, identifying the hypotheses aimed at explaining them. That is, we shall leverage 
the semantics of relations like FALL to (pvt ^ vs. This leap is a core abstraction 


^ As exhibited, e.g., in the Eureqa project |2|. 
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X Y 



Figure 2.1. Deterministic scientific hypotheses seen as alternative functions to 
predict data, giving rise to both theoretical and empirical sources of uncertainty. 


in this vision of T-DB. 

Predictive data. Scientific hypotheses are tested by way of their predic¬ 
tions [15] . In the form of mathematical equations, hypotheses symmetrically relate 
aspects of the studied phenomenon. However, for computing predictions, determin¬ 
istic hypotheses are applied asymmetrically as functions [30|. They take a given 
valuation over input variables (parameters) to produce values of output variables 
(predictions). By observing that, we shall seek a principled method to transform 
the (symmetric) mathematical equations of a hypothesis into (asymmetric) fd’s. 

By looking at deterministic hypotheses as alternative functions to predict 


data (see Fig. 2.1), in this vision we shall deal with two sources of uncertainty. 
Given a well-dehned context with a set of alternative hypotheses aimed at explain¬ 
ing (providing predictions for) a selected phenomenon: 


• Theoretical uncertainty]^ comprises selecting the best tentative model 
(function) to produce (the best) data? 

• Empirical uncertainty]^ comprises, for each candidate model, what is the 
(parameter) input setting that calibrates it the best way for the selected 
phenomenon? 


Note that these two sources of uncertainty are intertwined in that one cannot 
‘clean’ one without cleaning the other — neither theory nor parameters are directly 


^ That is, multiplicity of hypothesis entries associated with a phenomenon. 

^ That is, multiplicity of hypothesis trial entries associated with a phenomenon. 
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observable, but only their joint results (the predictions) [15]. In this thesis we aim 
at providing means to support such kind of ‘integrated’ analytics. 

Applications. Big computational science research programs such as the 
Human Brain Project]^ or Cardiovascular Mathematics]^ are highly-demanding 
applications challenged by such theoretical (big) data. Users need to analyze results 
of hundreds to thousands of data-intensive simulation trials. 

Besides, recent initiatives on web-based model repositories have been foster¬ 
ing large-scale model integration, sharing and reproducibility in the computational 
sciences (e.g., inn [32l EH]). They are growing reasonably fast on the web, (i) 
promoting some MathML-based standard for model specihcation, but (ii) with lim¬ 
ited integrity and lack of support for rating/ranking competing models. For those 
two reasons, they provide a strong use case for our vision of hypothesis manage¬ 
ment. The Physiome project [HH] El], e.g., is planned to integrate several large 
deterministic models of human physiology — a fairly simple model of the human 
cardiovascular system, e.g., has about 600-1- variables. 

Also, there is a pressing call for deep predictive analytic tools to support users 
assessing what-if scenarios in business enterprises [35]. Deep predictive analytics 
are based on hrst principles (deterministic hypotheses) and go beyond descriptive 
analytics or shallow predictive analytics such as statistical forecasting (ibid.)]^ 

U-relations. All that ratihes that hypothesis management is a promising 
class of applications for probabilistic DB’s. The vision of T-DB is currently set to 
be delivered on top of U-relations and probabilistic world-set algebra (p-WSA) [18] . 
These were developed in the influential MayBMS project]^ As implied by some of 
its design principles, viz., compositionality and the ability to introduce uncertainty, 
MayBMS’ query language hts well to hypothesis management. We shall look at it, 
as previously mentioned, from the point of view of a synthesis method for p-DB 
design. We shall particularly make use of the repair key operation, which gives 


http: //www. humcLnbrainpro j ect. eu/ 
http://icerm.brown.edu/twl4-1-pdecm, 

The concept of ‘deep’ predictive analytics is from Haas et al. 


], and is discussed in more 

detail in { 2.6.1 

^ Project website: http://maybms.sourceforge.net/, MayBMS is as a backend extension 
of PostgreSQL. It offers all the traditional querying capabilities of the latter in addition to the 
uncertain and probabilistic’s. 
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Figure 2.2. Predictive analytics in a data-intensive hypothesis evaluation study: 
hypotheses (simulated data) compete to explain a phenomenon (observed data). 


rise to alternative worlds as maximal-subset repairs of an argument key. 

Predictive analytics tool. In database research (e.g., [36]), uncertainty 
is usually seen as an undesirable property that hinders data quality. We shall 
refer to U-relations and p-WSA as implemented in MayBMS, nonetheless, to show 
that the ability to introduce controlled uncertainty into an (otherwise complete) 
simulation dataset can be a tool for ‘deep’ predictive analytics on a set of competing 


or alternative hypotheses. Fig j2.2| shows such a scenario of hypotheses ‘as data’ 
compete to explain a phenomenon ‘as data.’ 

As a roadmap to most of the remainder of this chapter, we claim that if hy¬ 


potheses can be encoded and identihed (see ^2.2), and their uncertainty quantihed 


by some probability distribution (see j ]2.4 ), then they can be rated/ranked and 
browsed by the user under selectivity criteria. Furthermore, their probabilities can 


be conditioned for possibly being re-ranked in the presence of evidence (see 12.5). 


2.1. Running Example 

Let us consider Example for the presentation of the vision. 

Example 1 A research is conducted on the effects of gravity on a falling object 
in the Earth’s atmosphere. Scientists are uncertain about the precise object’s 
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density and its predominant state as a flnid or a solid. Three hypotheses are then 


considered as alternative explanations of the fall (see Fig. 2.3). Due to parameter 


uncertainty, six simulation trials are run for Hi, and four for 7^2 and Hs each. □ 


PHENOMENON 

0 

Description 


1 

Effects of gravity on an object falling in 
the Earth’s atmosphere. 


HYPOTHESIS 

V 

Name 


1 

Law of free fall 


2 

Stokes’ law 


3 

Velocity-squared law 


Figure 2.3. Descriptive (textual) data of Example 
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Figure 2.4. ‘Big’ fact table Hi of hypothesis v = 1 loaded with simulation raw 
data: trials on Hi are identihed by tid. 


The construction of T-DB, a Data Warehouse (DW), requires a simple user descrip¬ 
tion of a research. That is, descriptive records of the phenomena and hypotheses 


dimensions (see Fig. 2.3) are to be inserted first such that basic referential con¬ 
straints are satisfied by their associated datasets (fact tables). For instance, each 
one of the six trial datasets for hypothesis Hi shall reference its id n = 1 as a 
foreign key from table HYPOTHESIS further in their synthesized relations. 

Fig. 2.4| shows the ‘big’ fact table Hi for hypothesis v = l loaded with its trial 
datasets for phenomenon 0=1. Although table Hi is denormalized for faster data 
retrieval as usual in DW’s, the extraction of the hypothesis equations allows to 
render it automatically since all variables must appear in some equation. Now we 
proceed to the hypothesis encoding and start to address research questions RQl-4. 
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2.2. Hypothesis Encoding 

We aim at extracting, for each hypothesis, a set of fd’s from its mathematical 
equations. Suppose we are given a set of equations of hypothesis "Hi below, and 
let us examine the set Si of fd’s we target atj^ 

T-Li- Law of free fall 

a{t) = g 

v(f) = -gt + Vo 


= -{gmt 

2 + 

Vot -|- So 

Si = { 0 

-)■ 

g, 

0 

-)■ 

Vo, 

0 

-)■ 

So, 

gv 

-)■ 

a, 

g Vo tv 

-)■ 

V, 

g Vo Sotv 

-)■ 

s }. 


In order to derive Si from the equations of "Hi, we focus on their implicit data de¬ 
pendencies and get rid of constants and possibly complex mathematical constructs. 
Equation v(t) = —^ft-l-vo, e.g., written this way (roughly speaking), suggests that 
V is a prediction variable functionally dependent on t (the physical dimension), g 
and Vo (the parameters). Yet a dependency like gvot ^ v may hold for inhnitely 
many equations]^ In fact, we need a way to identify T-Lfs mathematical formula¬ 
tion precisely, i.e., an abstraction of its data-Ievel semantics. This is achieved by 
introducing hypothesis id r; as a special attribute in the fd (see Si). 


This is a data representation of a deterministic scientihc hypothesis. It is built 


into an encoding scheme (see 13.4) that leverages the semantics of structural 
equations. 


The other special attribute, the phenomenon id 0, is supposed to be a key to the val¬ 
ues of parameters, i.e., determination of parameters is an empirical, phenomenon- 
dependent task. The fd 0 —)■ (7 vq sq is to be (expectedly) violated when the user 

® Recall that a rigorous presentation of the method to encode fd set Ei is due by Chapter 
® Think of, say, how many polynomials satisfy that dependency ‘signature.’ 
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is uncertain about the values of parameters. The same rationale applies to derive 
S 2 = S 3 from the equations of 7^2, Hs below. These, n.b., vary in structure w.r.t. 
1-Li (e.g., they include parameter D, the object’s diameter). 


7^2 • Stokes’ law 

'H3. 

Velocity-squared law 

a{t) = 0 

a{t) 

= 

0 

v(f) = —^gD/A.Qxlt)-'^ 

v{t) 

= 

-c/T>V 3 . 29 x 10 -'^ 

s{t) = —t^JgD/A.Qxlt]-^ + So 

s{t) 


-(c/T>V 3 . 29 xlO-®)f+So 

S2 = S3 = { 

0 

-)■ 

9, 


0 

-)■ 

D, 


0 

-)■ 

■So, 


0 

-)■ 

a, 


g Dv —>■ V, 
gDsotv —>■ s }. 

The key point here is that, if the hypothesis structure (set of equations) is 
given in a machine-readable format for mathematics, then the method to extract 
the hypothesis fd set from its equations can be carefully designed based on such hy¬ 
pothesis data representation abstraction. In fact, we shall explore WSC’s MathML 
as a format for hypothesis specihcationj^ 

2.3. Reasoning over FD’s 

Once each hypothesis fd set has been extracted, some reasoning is to be performed 
to discover implicit data dependencies. In fact, dependency theory is equipped 
with a formal system (cf. ^4.1[ ) for reasoning over fd sets like Si and derive other 
fd’s in its closure Sj*". As we elaborate on in Chapter we shall be particularly 
concerned with the pseudo-transitivity inference rule. Applied over fd’s { 0 —)■ 
g, gv ^ a} C. Si, for instance, it gives us —)■ a. This inference allows us 

to observe that is a ‘factor’ on the uncertainty of a, but {(j)v} should be a 
dimensional key constraint for values of a. 

In fact, note that derived fd’s like {4’V^ a) G Sj*", which should be a constraint 
on values of a in TTi, are (expectedly) violated in the presence of uncertainty: ob- 
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serve in Fig. 2.4 the multiplicity {32.0, 32.2} of values of a under the same pair 
(0 H->■ 1, t; I— >■ 1), which should functionally determine them in Hi. For that reason 
we admit a special attribute ‘trial id’ tid to be overimposed into Hi for a trivial 
repair, provisionally, until uncertainty can be introduced in a controlled way by 
synthesis ‘4U.’ It is meant to identify simulation trials and “pretend” certainty not 
to lose the integrity of the data. It is under this imposed certainty that the raw 


simulation trial data is safely loaded from hies (see Fig. 2.4). Note, however, how 
‘certainty’ is held at the expense of redundancy and, mostly important, opaque¬ 
ness for predictive analytics (since tid isolates or hides the inconsistency w.r.t. to 
the violated constraints). This is until the next stage of the T-DB construction 
pipeline, when uncertainty is to be introduced in a controlled manner. 


2.4. Uncertainty Introduction 

Before we proceed to the uncertainty introduction procedure, note in relation 


Hi (Fig. 2.4), that the predicted acceleration values a are such that an associa¬ 
tion between the hypothesis and a target phenomenon, viz., {4> ^ l,r; i—)■ 1) is 
established. In fact, as of the insertion of each hypothesis trial dataset, the user 
must set for it a target phenomenon. This may be non-obvious but is quite con¬ 
venient a design decision for the envisioned system of T-DB because hypotheses, 
as (abstract) universal statements [I5], can only be derived predictions from (be 
empirically grounded) by assigning (callibrating) them onto some real-world phe¬ 
nomenon. This assignment is set at data entry time because in fact it only holds 
at the data levelp^ It is to be recorded in an ‘explanation’ table named Hq by 
default (see Fig. |2.5| top), being provided with weights for establishing a prior 
probability distribution which (by user choice) may or may not be uniform. 

The data transformation of ‘certain’ to ‘uncertain’ relations then starts with 


query Qo, whose result set is materialized into U-relational table Yq (see Fig. 2.5). 


As we introduce in detail in 15.1, U-relations have in their schema a set of pairs 
(V), Di) of condition columns [18] to map each discrete random variable Xj cre¬ 
ated by the repair-key operation to one of its possible values (e.g., xq >--)■ 1). The 


Hypotheses are ‘universal’ by definition m- They (must) qualify for a class of different 
situated phenomena, while its predictive datasets must be very specific (for one specific situation). 
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Figure 2.5. ‘Explanation’ relational table Hq and its associated U-relational 
table Yq rendered by application of the repair-key operation. 


world table W is internal to MayBMS’ and antomatically stores their marginal 
probabilities. The formal semantics the repair-key operation is given in §5.1 


Qo. create table Yq as select 0, v from (repair key 0 in Hq weight by Conf); 


The possible-world semantics of p-DB’s (cf. ^5.1[ ) can be seen as a gener¬ 
alization of data cleaning. In the context of p-DB ’s ini, data cleaning does not 
have to be one-shot — which is more error-prone [37]. Rather, it can be carried 
ont gradnally, viz., by keeping all mntnally inconsistent tnples nnder a probability 
distribntion (ibid.) that can be npdated in face of evidence nntil the probabilities 
of some tnples eventnally tend to zero to be eliminated. This motivates Remark 


Remark 1 Consider U-relational table Yq (Fig. 2.5). Note that it abstracts the 
goal of a data-intensive hypothesis evalnation stndy, or the scientihc method it¬ 
self [15], as the repair of each 0 as a key. That is, in T-DB nsers can develop 
their research directly npon data with snpport of qnery and npdate capabilities to 
rate/rank their hypotheses v w.r.t. each 0, nntil the relationship r(0, v) is repaired 
to be a function /:$—;■ T from each phenomenon 0 to its best explanation v. □ 


Given a ‘big’ fact table snch as Hi, we need to identify/gronp the correlated 
inpnt attribntes nnder independent nncertainty nnits, viz., ‘n-factors,’ each one 
associated with a random variable]^ We illnstrate that by means of qnery Qi, 
which materializes view Yi[g] for (let g = Zf) identihed n-factor Zj C Z in i7i[0, Z\. 


An attribute can be inferred ‘input’ (viz., a parameter) by means of fd reasoning (cf. (3.4|. 
























2.4. UNCERTAINTY INTRODUCTION 


24 


Hi[t\d, cj), Z] 

tid 

4> 

9 

Vo 

■So 


1 

1 

32 

0 

5000 


2 

1 

32 

10 

5000 


3 

1 

32 

20 

5000 


4 

1 

32.2 

0 

5000 


5 

1 

32.2 

10 

5000 


6 

1 

32.2 

20 

5000 


Ulsl 

V^D 

0 

9 


Xi !-)■ 1 

1 

32 


Xi 1 —y 2 

1 

32.2 


W 

V^D 

Pr 


X]^ 1 —y 1 

.5 


Xi 1 —y 2 

.5 


Figure 2.6. Result set of query Qi on simulation trial dataset for hypothesis Hi. 


Qi. create table Yi[g] as select U.0, U.^f from (repair key 0 in 
(select 0, g, count(*) as Fr from Hi group by 0, g) 
weight by Fr) as U; 


The result set of Qi is stored in 50[s'], see Fig. 2.6, Note that the possible values 
of g are mapped to random variable xi, and that table Hi is considered source 
for a joint probability distribution (on the values of Hi’s input parameters) which 
may not be uniform: we count the frequency Fr of each possible value of a u-factor 
Zj C Z (as done for g in Qi) and pass it as argument to the weight-by construct. 

So far, we have presented informally the procedure of u-factorization. Now 
we proceed to u-propagation — both are presented rigorously in Chapter We 
consider gv ^ a & T^i again in order to synthesize predictive U-relation 50[a]. 
Since a is functionally determined by v and g only, and these are independent, we 
propagate their uncertainty onto a into 50[a] by query Q 2 . 


Q 2 . create table 50[a] as select Hi.cf), Hi.v, Hi.a from Hi, 50, Yi[g] as G 
where Hi.(p=Yo.(p and Hi.v=Yq.v and G.0=iifi.0 and G.g=Hi.g-, 


Query Q 2 (not shown) then selects 0, v and a from 50[a] for each i = 1..3. The 


result sets of Q 2 and Q 2 (resp. 50[a] and 5^[a]) are shown in Fig. 2.7 
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Figure 2.7. U-relational predictive tables rendered by query using the fd’s. 

Compare relations Hi[a\ and hi [a]. By accounting for the correlations cap¬ 
tured in the fd g v ^ a, we could propagate onto a the uncertainty coming from 
the hypothesis and the only parameter a is sensible to, thus precisely situating 
tuples of Yi [a] in the space of possible worlds. The same is done for predictive at¬ 
tributes V and s. In the end, T-DB shall be ready for predictive analytics, i.e., with 
all competing predictions as possible alternatives which are mutually inconsistent. 

A key point here is that all the synthesis process is amenable to algorithm 
design. Except for the user ‘research’ description, the T-DB construction is fully 
automated based on the hypothesis structure (set of equations) and the raw hy¬ 
pothesis trial data. 


2.5. Predictive Analytics 

Users of Example has to be able, say, to query phenomenon 0 = 1 w.r.t. 
predicted position s at specihc values of time t by considering all hypotheses v 
admitted. That is illustrated by query Q^, which creates integrative table U[s]; 
and by query Q 4 , which computes the confidence aggregate operation [18] for all s 


tuples where t = 3 (Fig. 2.8 shows Qfis result, apart from column Posterior). 

The conhdence on each hypothesis for the specihc prediction of Q 4 is split 
due to parameter uncertainty such that they sum up back to its total conhdence. 
For H 2 and e.g., we have {g D Sotv ^ s} C T, where T = S 2 = E 3 . Since g 
and D are the parameter uncertainty factors of s (sq is certain), with 2 possible 
values (not shown) each, then there are only 2x2 = 4 possible s tuples for H 2 and 
7/3 each. Considering all hypotheses v for the same phenomenon 0, the conhdence 
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values sum up to one in accordance with the laws of probability. 

Q 3 . create table y[s] as select U. 0 , U.u, U.t, U.s from 
(select <p, V, t, s from Yi[s] union all 
select 0, V, t, s from 1^2[-s] union all 
select 0, V, t, s from l3[s]) as U, Yq 
where U.0=lo </> and ^■v=Yq.v, 


Q4. select (j), V, s, conf() as Prior from F[s] where t=3 
group by 0, v, s order by Prior desc; 
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Figure 2.8. Analytics on predicted position s conditioned on observation. 


Users can make informed decisions in light of such conhdence aggregates, 
which are to be eventually conditioned in face of evidence (observed data). Ex¬ 
ample features such kind of Bayesian conditioning for discrete random variables 
mapped to the possible values of predictive attributes (like position s) whose do¬ 
main are continuous. 


Example 2 Suppose position s = 2250 feet is observed at t = 3 secs, with stan¬ 
dard deviation a = 20. Then, by applying Bayes’ theorem for normal mean with 


a discrete prior [16], Prior is updated to Posterior (see Fig. 2.8). □ 
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The procedure uses normal density function (2.1), with (say) a = 20, to get 


the likelihood /(|/1/ifc) of each alternative prediction of s from y[s] as mean /i^ 


given y at observed s = 2250. Then it applies Bayes’ rule (2.2) to get the posterior 

pi^^k\y) [IS]. 


fivlPk) = 


V2 


e 2a- 


■{y-k-kf 


Tia^ 


I y) = f{y I yt) p(yt) / EILi f(v I f*.) pipt 


( 2 . 1 ) 

( 2 . 2 ) 


In the general case (cf. examples shown in Chapter we actually have phe- 
nomenon ‘as data:’ a sample of independent observed values yi, ..., yn (e.g., Brazil’s 
population observed by census over the years). Then, the likelihood f{yi, ..., yn\ Pk) 
for each competing trial /i^, is computed as a product n?=i fiVj I Pkj) of the sin¬ 


gle likelihoods f{yj \ ykj) [IS]. Bayes’ rule is then settled by (2.3) to compute the 
posterior p(yk \ Vi, ■■■, Vn) given prior p(/ife). 


pipk \yii ... 5 yn) 


Il%i f{yj I Pkj) p{pk) 

m n 

zn f{yj\Pij) piPi) 

i=l j=l 


(2.3) 


As a result, the prior probability distribution assigned to u-factors via repair key 
is to be eventually conditioned on observed data. This is an applied Bayesian infer¬ 
ence problem that translates into a p-DB update one to induce effects of posteriors 
back to table W. In a first prototype of the T-DB system (cf. |6.3 ), we accom¬ 
plish it by performing Bayesian inference at application level and then applying 
p-WSA’s update (a variant of SQL’s update) into MayBMS. This solution is good 
enough to let us complete use case demonstrations of T-DBp^ 


2.6. Related Work 

The vision of managing hypotheses as data has some roots in Porto and 
Spaccapietra [38], who motivated a conceptual data model to support (the so- 
called) in silica science by means of a scientihc model management system. We 


Cf. Chapter and (6.3 in particular. 
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discuss now the work we understand to be mostly related to our vision of data- 
driven hypothesis management and analytics. 

2.6.1 Models-and-data 

Haas et ah [35] provide an original long-term perspective on the evolution of 
database technology. They characterize the data typically managed by traditional 
DB systems as a record about the past, not a conclusion or an insight or a solution 
(ibid.). In the context of scientihc databases, e.g., their position is suggestive 
that DB technology has been designed for empirical data, not the theoretical data 
generated by simulation from domain-specihc principles or scientihc hypotheses. 

They recognize current DB technology to have raised the art of scalable 
‘descriptive’ analytics to a very high level; but point out, however, that nowadays 
(sic.) what enterprises really need is ‘prescriptive’ analytics to identify optimal 
business, policy, investment, and engineering decisions in the face of uncertainty. 
Such analytics, in turn, shall rest on deep ‘predictive’ analytics that go beyond mere 
statistical forecasting and are imbued with an understanding of the fundamental 
mechanisms that govern a system’s behavior, allowing what-if analyses [35]. In 
sum, there is a pressing call for deep predictive analytic tools in business enterprises 
as much as in science’s. 

In comparison with the T-DB vision, Haas et ah are proposing a long-term 
models-and-data research program to pursue data management technology for deep 
predictive analytics. They discuss strategies to extend query engines for model 
execution within a (p-)DB. Along these lines, query optimization is understood as 
a more general problem with connections to algebraic solvers. 

Our framework in turn essentially comprises an abstraction and technique 
for the encoding of hypotheses as data. It can be understood (in comparison) as 
putting models strictly into a flattened data perspective. For that reason it has 
been directly applicable by building upon recent work on p-DBs DU- In principle, 
it can be integrated into, say, the OLAP layer of the models-and-data project. 

2.6.2 Scientific simulation data 

As previsouly mentioned, science’s ETL is distinguished by its unfrequent, incremen¬ 
tal-only updates and by having large raw hies as data sources [8]. Challenges for 
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enabling an efficient access to high-resolntion, raw simulation data have been doc¬ 
umented from both supercomputing, |6] and database research viewpoints; [39] and 
pointed as key to the use case of exploratory analytics. The extreme scale of the 
raw data has motivated such non-conventional approaches for data exploration, 
viz., the ‘immersive’ query processing (move the program to the data) [6l HD], or 
‘in situ’ query processing in the raw hies [111112]. Both exploit the spatial structure 
of the data in their indexing schemes. 

That line of research is motivated for equipping scientist end-users for an 
immediate interaction with their very large simulation datasetsj^ The NoDB 
approach, in particular, argues to eliminate such ETL phase (viz., the loading) for 
a direct access to data ‘in situ’ in the raw data hies [l2]. In fact, data exploration 
is a fundamental use case of data-driven science. 

Nonetheless, being generated from hrst principles or learned deterministic 
hypotheses, simulation data has a pronounced uncertainty component that moti¬ 
vates a another use case, viz., the case of hypothesis management and predictive 
analytics [35l [10]. As we have motivated in ^1.1, the latter requires probabilistic 


DB design for enabling uncertainty decomposition (factorization). 

Hypothesis management shall not deal with the same volume of data as in 
simulation data management for exploratory analytics, but samples of it (cf. Table 
|l.l| for a comparison). For instance, in CERN’s particle-physics experiment ATLAS 
there are four tier/layers of data management. The volume of data signihcantly 
decreases from the (tier-0) raw data to the (tier-3) data actually used for analyses 
such as hypothesis testing [HI p. 71-2], 

Overall, the overhead incurred in loading samples of raw simulation trial 
datasets into a p-DB is justihed for enabling a principled hypothesis evaluation 
and rating/ranking according to the scientihc method. 


2.6.3 Hypothesis encoding 

Our framework is comparable with Bioinformatics’ initiatives that address 
hypothesis encoding into the RDF data model [13]: (i) the Robot Scientist [H] is a 
knowledge-base system (KBS) for automated generation and testing of hypotheses 


Sometimes phrased ‘here is my files, here is my queries, where are my results?’ |41j . 
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about what genes encode enzymes in the yeast organism; (ii) HyBrow |15] is a KBS 
for scientists to test their hypotheses about events of the galactose metabolism 
also of the yeast organism; and (iii) SWAN [16] is a KBS for scientists to share 
hypotheses on possible causes of the Alzheimer disease. 

The Robot Scientist relies on rule-based logic programming analytics to au¬ 
tomatically generate and test RDF-encoded hypotheses of the kind ‘gene G has 
function A’ against RDF-encoded empirical data [H]. HyBrow is likewise, but 
hypotheses are formulated by the user about biological events [15]. SWAN in 
turn disfavors analytic techniques for hypothesis evaluation and focus on descrip¬ 
tive aspects: hypotheses are high-level natural language statements retrieved from 
publications. Each ‘hypothesis’ is associated with lower-level ‘claims’ (both RDF- 
encoded) that are meant to support it on the basis of some empirical evidence 
(RDF-encoded gene/protein data). In particular, SWAN |36] differs from the for¬ 
mer in that each hypothesis is unstructured, being then more related to efforts on 
the retrieval of textual claims from the narrative fabric of scientihc reports m- 

All of them though, consist in some ad-hoc RDF encoding of sequence and 
genome analysis hypotheses under varying levels of structure (viz., from ‘gene G 
has function A’ statements to free text). Our framework in turn consists in the 
U-relational encoding of hypotheses from mathematical equations, which is (to our 
knowledge) the hrst work on hypothesis relational encoding. 

Finally, as for hypothesis evaluation and comparison analytics, the T-DB 
vision is distinguished in terms of its Bayesian inference approach. The latter has 
been pointed out as a major direction for the improvement of the Bioinformatics’ 
initiatives just mentioned (cf. [lH p. 13]), and is in fact an influential model of 
decision making for hypothesis evaluation [151 P- 220]. 

2.7. Summary: Key Points 

We outline some key points in the T-DB vision: 

• ‘Structured deterministic hypotheses’ are encoded as theoretical data and 
distinguished from empirical data by the introduction of an epistemological 
dimension into their semantic structure. 
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• Two sources of uncertainty are considered: theoretical uncertainty, origi¬ 
nating from competing hypotheses; and empirical uncertainty, derived from 
alternative simulation trials on each hypothesis for the same phenomenon. 

• A method to extract the structure of a hypothesis can be carefully designed 
based on a hypothesis data representation and shall be reducible in terms of 
machine-readable format for mathematical modeling, viz., WSC’s MathML, 
which we shall adopt as a standard for hypothesis specification. 

• We have seen that the controlled introduction of uncertainty into simula¬ 
tion data is amenable to algorithm design and then reducible to a design- 
theoretic synthesis method for the construction of U-relational DB’s. 

• Simulation data can be modeled as hypothesis data whenever it is associ¬ 
ated with a target phenomenon. As the same phenomenon may happen 
to be associated with many such hypotheses, the research activity can be 
modeled as a data cleaning problem in p-DB’s. 


Essentially, the vision of T-DB comprises a design-theoretic pipeline (Fig. 


1.4). For the insertion of a hypothesis k, we shall be given a MathML-compliant 
structure Sk together with its simulation trial datasets V^. in raw files (e.g., .mat, 
.csv). Then we apply an Extract-Transform-Load (ETL) automatic procedure to 
generate the hypothesis ‘big’ fact table Hk under the trial id’s. 

The extracted equations are hrstly encoded into fd’s. Then, at any time, as 
many hypotheses may have been inserted into the system, the uncertainty introduc¬ 
tion (U-intro) procedure can be applied to process the encoded fd’s and synthesize 
the ‘uncertain’ U-relations that are to be eventually conditioned on observations. 


Note, in Fig. lA, that the ETL procedure is operated in a ‘local’ view for 
each hypothesis k, while the U-intro procedure and the conditioning are operated 
in the ‘global’ view of all available hypotheses k = l..n. The pipeline opens up 
four main tracks of technical research challenges from the ETL stage on, viz., (i) 
hypothesis encoding and (ii) causal reasoning over fd’s, (iii) p-DB synthesis and 
(iv) conditioning. We address in the sequel the three hrst track of challenges in 
depth. The problem of conditioning is outlined for further work in 





Chapter 3 


Hypothesis Encoding 


In this chapter we address the problem of hypothesis encoding. In §3.1| we 
introduce notation and basic concepts of structural equations and the problem of 


causal ordering. In ^3.2 we study the problem of extracting the causal ordering 
implicit in the structure of a deterministic hypothesis and show that Simon’s clas¬ 


sical approach [211118] is intractable. In ^3.3| then we build upon a less notorious 
approach of Nayak’s |19] and borrow an efficient algorithm for it that fits very 
well our use case for hypothesis encoding. In §3.4| we develop an encoding scheme 
that builds upon the idea of structural equations through an original abstraction of 


hypotheses ‘as data.’ In ^3.5 we present experiments that attest how the encoding 


scheme works in practice for large hypotheses. In 13.6 we discuss related work. In 


13.7 we summarize the results of this chapter. 


3.1. Preliminaries: Structural Equations 

Given a system of mathematical equations involving a set of variables, to build a 
structural equation model (SEM) is, essentially, to establish a one-to-one mapping 
between equations and variables [2l|. That shall enable further detecting the 
hidden asymmetry between variables, i.e., their causal ordering. For instance, 
Einstein’s famous equation E = mc^ states the equivalence of mass and energy, 
summarizing a theory that can be imputed two different asymmetries (for different 
applications), say, given a fixed amount of mass m = mo (and recall c is a constant), 
predict the particle’s relativistic rest energy E] or given the particle’s rest energy, 
predict its mass or potential for nuclear fission. 
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Figure 3.1. “Directed causal graphs” associated with the two systems. 


To stress the point, consider Newton’s second law F = ma in such a scalar 
setting. The modeler can either use it to compute (predict), say, acceleration 
values given an amount of mass and different force intensities, or to predict force 
intensities given a hxed acceleration (e.g., for testing an engineered dynamometer). 
The point here is that Newton’s equation is not enough to derive predictions. That 
is, it has a number of variables |V| = 3, which is larger than |T| = 1. It must be 
completed with two more equations in order to qualify as an (applied) hypothesis 
‘as data.’ Although usually it is interpreted an asymmetry towards a, technically, 
there is nothing in its semantics to suggest soj^ Compare the two systems given 
in Fig. 3.1p In sum, the causal ordering of any system of equations is not to be 
guessed, as it can be inferred. In this chapter we rely on previous work (mostly 
AI’s work, viz., |2H 091 08]) and adapt it for the encoding of hypotheses into fd’s. 


Def. 1 A structure is a pair S{£, V), where T is a set of equations over set Vof 
variables, \£\ < |V|, such that: 

(a) In any subset of k equations of the structure, at least k different variables 
appear; 

(b) In any subset of k equations in which r variables appear, A: < r, if the 
values of any (r — k) variables are chosen arbitrarily, then the values of the 


^ As the equality construct ‘=’ is used as a predicate, not an assignment operator. 
^ We shall introduce the notion of ‘directed causal graphs’ shortly. 
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remaining k variables can be determined uniquely — finding these unique 
values is a matter of solving the equations. 

Def. 2 Let iS(£^, V) be a structure. We say that S is self-contained or complete 
if |^| = |V|. 

In short, we are interested in systems of equations that are ‘structural’ (Def. and 
‘complete’ (Def. [^, viz., that has as many equations as variables and no subset of 
equations has fewer variables than equations]^ 

Complete structures can be solved for unique sets of values of their variables. 
In this work, however, we are not concerned with solving sets of mathematical 
equations at all, but with processing their causal ordering in view of U-relational 
DB design. Simon’s concept of causal ordering has its roots in econometrics studies 
(cf. 123 ) and has been taken further in AI with a flavor of Graphical Models (GMs) 
[50l [25l 08]. In this thesis we translate the problem of causal ordering into the 
language of data dependencies, viz., into fd’s. 

Def. 3 Let iS be a structure. We say that S is minimal if it is complete and there 
is no complete structure S' C S. 

Def. 4 The structure matrix As of a structure 5(£^, V), with /i, 

and Xi, X 2 , ■.. ,Xm G V, is a n x m matrix of I’s and O’s in which entry aij is 

non-zero if variable Xj appears in equation fi, and zero otherwise. 

Elementary row operations (e.g., row multiplication by a constant) on the 
structure matrix may hinder the structure’s causal ordering and then are not valid 
in general [23]. This also emphasizes that the problem of causal ordering is not 
about solving the system of mathematical equations of a structure, but identifying 
its hidden asymmetries. 

Def. 5 Let iS(£^, V)be a complete structure. Then a total causal mappiug over 

5 is a bijection (p\ S such that, for all f E £, A (p{f) = x then xE Vars{f). 

^ Also, we expect the systems of equations given as input to be ‘independent’ in the sense of 
Linear Algebra. In our context, that means systems that can only have non-redundant equations. 
In that case, if some subset of equations has fewer variables than equations, then the system must 
be ‘overconstrained.’ 
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(a) Structure matrix as given. (b) COA execution in 3 recursive steps. 


Figure 3.2. Running Simon’s Causal Ordering Algorithm (COA) on a given 
structure (Fig. 3.2a). Minimal subsets detected in a recursive step k, highlighted 
in different shades of gray, have their diagonal elements colored (Fig. 3.2b). 


Simon has informally described an algorithm (cf. |23]) that, given a complete struc¬ 
ture 5(£^, V), can be used to compute a partial causal mapping ipp from partitions 
on the set of equations to same-cardinality partitions on the set of variables. As 
shown by Dash and Druzdzel |18], the causal mapping returned by Simon’s (so- 
called) Causal Ordering Algorithm (COA) is not total when S has variables that 
are strongly coupled (because they can only be determined simultaneously). They 
also have shown that any total mapping ip over S must be consistent with COA’s 
partial mapping pp |18]. The latter is made partial by design (merge strongly 
coupled variables into partitions or clusters) in order to force its induced causal 
graph to be acyclic. Algorithm COA*, is a variant of Simon’s COA adapted 
to illustrate our use case. It returns a total causal mapping p, instead of a partial 
causal mapping. We illustrate it through Example and Fig. 3.2 


Example 3 Consider structure 5(T, V) whose matrix is shown in Fig. 3.2a, Note 
that S is complete, since |T| = |V| = 7, but not minimal. The set of all minimal 
subsets S' C S is Sc = { {fi}, {f 2 }, {/s} }• By eliminating the variables identihed at 
recursive step fc, a smaller structure T C 5 is derived. Compare the partial causal 
mapping eventually returned by COA, (pp = { ({/J, {xi}), ({/ 2 }, { 0 : 2 }), ({/s}, {xs}), 
({/ 4 ,/ 5 }, {X 4 ,X 5 }), ({/ 6 },{X 6 }), ({Z?}, {X 7 }) }, to the total causal mapping re¬ 
turned by COA*, (p={(/i,Xi), (/2,X2), (/3,X3), (/4,X4), ifr^Xj)}. 
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Algorithm 1 COA^ as a variant of Simon’s COA. 

1: procedure COAt{S: structure over S and V) 

Require: S given is complete, i.e., \£\ = |V| 

Ensure: Returns total causal mapping ip ■. £ 

2: ip i — 0, Sc ^— 0 

3: for all minimal S' d S do 

4: ^ iSfc U 5' > store minimal structures S' found in S 

5: V d- 5'(V) 

6: for all / e 5'(^) do 

7: X any Xa G V 

8: ip^ ipd\{f, x) 

9: V' ^ V' \ {x} 

10: T ^ S \ U^'e^fc 

11: if T ^ 0 then 

12: return ip U COAt(T) 

13: return ip 


Since x^ and x^ are strongly coupled (see Fig 3.2b), COAt maps them arbitrarily 
(e.g., it could be /4 xs, /s h->• X 4 instead). Such total mapping ip renders a cycle 


in the directed causal graph induced by ip (see Fig 3.3). □ 



X6 X7 


Figure 3.3. Directed causal graph induced by mapping ip for structure S. An 
edge connects a node Xi towards a node xj, with Xi,Xj G V, iff Xi appears in the 
equation f & £ such that </?(/) = Xj. 


3.2. The Problem of Causal Ordering 

The serious issue with Alg. COA(t), is that hnding all minimal structures 
in a given structure (cf. line 3) is a hard problem that can only be addressed 
heuristically as a problem of co-clustering (also called biclustering [5ll [52]) in 
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Boolean matrices. Simon’s approach, however, as we shall see next, is not the only 
way to cope with the problem of cansal ordering. 

In fact, in order to stndy the compntational properties of SEM’s and the 
problem of cansal ordering, we observe that any structnre iS(£^, V) satisfying Def. 
[^can be modeled straightforwardly as a bipartite graph G = (Vi U V 2 ,£'), where 
the set £ of eqnations and the set V of variables are the disjoint vertex sets, i.e., 
El I— )■ V 2 t— )■ V, and E ^ S is the edge set connecting equations to the variables 
appearing in them. Fig. |3.4| shows the bipartite graph G corresponding to the 
structure given in Example — for a comprehensive text on graph concepts and 
its related algorithmic problems, cf. Even |53] . 



A biclique (or complete bipartite graph) is a bipartite graph G = {AUB, E) 
such that for every two vertices a E A, h E B, we have (a, b) E E |53]. Note that 
for balanced bicliques, i.e., when |A| = \B\ = K, the degree deg{u) of any vertex 
u E AVJ B must be deg{u) = |A| = \B\ = K. 

Recent approaches to co-clustering problems (e.g., [M]) have come with the 
notion of pseudo-biclique (also called ‘quasi-biclique’), which is a relaxation of the 
biclique concept to allow some less rigid notion of connectivity than the ‘complete 
connectivity’ required in a biclique. Now, recall that Simon’s COA(t) needs to 
hnd, at each recursive step, all minimal subsets S' C S. Theorem situates this 
particular computational task in terms of its complexity, which for \£'\ = |V'| > 2 
is equivalent to hnd, at each recursive step, the minimal-size ‘pseudo-bicliques’ 
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(i.e., with the least > 2) in its corresponding bipartite graph (e.g., see Fig. 3.4). 
Here we take, in Def. a specihc notion of pseudo-biclique. 


Def. 6 Let G = (HUH, F') be a bipartite graph. We say that G is a iF-balanced 
pseudo-biclique if |H| = \B\ = K with \E\ > 2K and, for all vertices u E AUB, 
degiu) > 2. 


Now we state (originally) the balanced pseudo-biclique problem (BPBP) as a deci¬ 
sion problem as follows. 


(BPBP). Given a bipartite graph G = {Vi U V 2 , E) and a positive integer 
K > 2, does G contain a iP-balanced pseudo-biclique? 


Lemma 1 The balanced pseudo-biclique problem (BPBP) is NP-Complete. 


Proof 1 We show (by restriction ^5]) that the BPBP is a generalization of the 
balanced biclique problem (BBP), referred ‘balanced complete bipartite subgraph’ 
problem |55l GT24, p. 196], which is shown to be NP-Gomplete by means of a 
transformation from ‘clique’ [Ml P- 446]. The restriction from BPBP to BBP 
(special case) is made by requiring (cf. Def. either (a) \E\ = K‘^ or (b) degiu) = 
K, for iF > 20 which are equivalent ways of enforcing the inquired iP-balanced 
pseudo-biclique to be a iP-balanced biclique. □ 


We introduce another hypothesis structure (see Fig. 3.5) to illustrate the corre¬ 
spondence between the pseudo-biclique property and COA’s algorithmic approach 
as elaborated in the proof of Theorem 


Theorem 1 Let iS(£^, V) be a complete structure. Then the extraction of its causal 
ordering by Simon’s COA(iS) is NP-Hard. 


Proof 2 We show that, at each recursive step k of COA, to hnd all non-trivial 
minimal subsets (i.e., \£'\ > 2) translates into an optimization problem associated 
with the decision problem BPBP, which we know by Lemma[^to be NP-Gomplete. 
See Appendix §A.1.1 


□ 


Note that clearly, for any positive integer K > 2, we have for > 2K. 
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Xi X2 X3 X4 

/i 1 0 1 0 

/2 1 1 0 0 

/s 0 1 1 0 

/4 1 1 10 

(a) COA execution in 2 recursive steps. 



Figure 3.5. Another hypothesis structure example. 


Nonetheless, the problem of causal ordering can be solved efficiently by means 
of a different, less notorious approach due to Nayak [l9], which we introduce and 
build upon next. 


3.3. Total Causal Mappings 

The problem of causal ordering can be solved in polynomial time by (i) 
hnding any total causal mapping ip\ E V over structure S given (cf. Def. [^; 


and then (ii) by computing the transitive closure of the set (cf. Eq. 3.1) of 
direct causal dependencies induced by (p. 


Cip= { (xa,Xb) I there exists / € T such that (p(f) = Xb and Xa G Vars{f) } (3.1) 

Def. 7 Let 5(£^, V) be a structure with variables Xa,Xb E V, and (p a total causal 
mapping over S inducing set of direct causal dependencies and its transitive 
closure We say that (xb, Xa) is a direct causal dependency in S if (xb, Xa) G 
and that {xb,Xa) is a causal dependency in S if {xb,Xa) G C'+. 

In other words, {xa,Xb) is in iff Xb direct and causally depends on Xa, 
given the causal asymmetries induced by ip. Those notions open up an approach 
to causal reasoning that hts very well to our use case, which is aimed at encoding 
hypothesis structures into fd sets and then performing (symbolic) causal reasoning 
in terms of acyclic pseudo-transitive reasoning over fd’s (cf. Chapter For it 

to be effective, nonetheless, we shall need to ensure some properties of total causal 
mappings hrst. 


® Note that it differs from AI research (e.g., [48]) geared for reasoning over GM’s. 
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For a given structure S, there may be multiple total causal mappings over 
S (recall Example]^. But the causal ordering of S must be unique (see Fig. 3.3). 
Therefore, a question that arises is whether the transitive closure is the same 
for any total causal mapping (p over S. Proposition originally from Nayak 


ensures that is the case. 


Proposition 1 Let iS(T, V) be a structure, and (pi: T —V and (p 2 : £ ^ V he 
any two total causal mappings over S. Then . 


Proof 3 The proof is based on an argument from Nayak |l9], which we present in 


arguably much clearer way (see Appendix, IA.1.2). Intuitively, it shows that if ifi 
and ip 2 differ on the variable an equation / is mapped to, then such variables, viz., 
(pi(/) and must be causally dependent on each other (strongly coupled). □ 


Another issue is concerned with the precise conditions under which total 
causal mappings exist (i.e., whether or not all variables in the equations can be 
causally determined). In fact, by Proposition]^ based on Nayak |19] apud. Hall 
[53l p. 135-7], we know that the existence condition holds iff the given structure 
is complete. 

Before proceeding to it, let us refer to Even ^3] to briefly introduce the 
additional graph-theoretic concepts which are necessary here. A matching in a 
graph is a subset of edges such that no two edges in the matching share a common 
node. A matching is said maximum if no edge can be added to the matching 
(without hindering the matching property). Finally, a matching in a graph is said 
‘perfect’ if every vertex is an end-point of some edge in the matching — in a 
bipartite graph, a perfect matching is said a complete matching. 


Proposition 2 Let S{£, V) be a structure. Then a total causal mapping ip: £ ^ 
V over S exists iff S is complete. 


Proof 4 We observe that a total causal mapping £ ^ V over S corresponds 
exactly to a complete matching M in a bipartite graph B = (Vi U V 2 ,E), where 
Vi £, V 2 V, and E S. In fact, by Even apud. Hall’s theorem (cf. [53l 
135-7]), we know that B has a complete matching iff (a) for every subset of vertices 
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F C Vi, we have |F| < \E{F)\, where E{F) is the set of all vertices connected 
to the vertices in F by edges in E] and (b) |Vi| = IV 2 I. By Def. (no subset of 
equations has fewer variables than equations), and Def. (number of equations 
is the same as number of variables), it is easy to see that conditions (a) and (b) 
above hold iff 5 is a complete structure. □ 

The problem of hnding a maximum matching is a well-studied algorithmic 
problem. In this thesis we adopt the Hopcroft-Karp algorithm |57], which is known 
to be polynomial-time, bounded by 0{\/\Vi\ + IV 2 I |i?|)|^ That is, we handle the 
problem of total causal mapping by (see Alg. translating it to the problem of 
maximum matching in a bipartite graph (in linear time) and then applying the 
Hopcroft-Karp algorithm to get the matching and hnally translate it back to the 
total causal mapping, as suggested by the proof of Proposition 

Algorithm 2 Find a total causal mapping for a given structure. 

1: procedure TCM(iS : structure over £ and V) 

Require: S given is a complete structure, i.e., \£\ = |V| 

Ensure: Returns a total causal mapping ip 

2: R(ViUV2,^)^0 

3: (p ^ 0 

4: for all (/, X) G 5 do > translates the structure 5 to a bipartite graph B 

5: Hi ^ Hi U {/} 

6: for all a; G X do 

7: V 2 <— V 2 U {x} 

8: E^EU{{f,x)} 

9: M 4- Hopcroft-Karp(i?) > solves the maximum matching problem 

10: for all (/, x) E M do > translates the matching to a total causal mapping 
11: if ^ ifU {{f,x)} 

12: return ip 


Fig. 3.6 shows the complete matching found by the Hopcroft-Karp algorithm for 
the structure given in Example 

® The Hopcroft-Karp algorithm solves maximum matching in a bipartite graph efficiently as 
a problem of finding maximum flow in a network (cf. |531 p. 135-7], or |58l p. 664-9]). 
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Figure 3.6. Complete matching M for structure S from 


Example 


Corollary summarizes the results we have so far. 

Corollary 1 Let V) be a complete structure. Then a total causal mapping 
if : £ ^ V over S can be found by (Alg. g) TCM in time that is bounded by 

o(vW|s|). 

Proof 5 Let B = (Vi U V 2 , E) be the bipartite graph corresponding to complete 
structure S given to TCM, where Ci h->• T, V 2 V, and E S. The translation of 
S into B is done by a scan over it. This scan is of length |iS| = \E\. Note that num¬ 
ber |E| of edges rendered is precisely the length |i5| of structure, where the denser 
the structure, the greater |5| is. The re-translation of the matching computed by 
internal procedure Hopcroft-Karp, in turn, is done at expense of \£\ = |V| < |iS|. 
Thus, it is easy to see that TCM is dominated by the maximum matching algorithm 
Hopcroft-Karp, which is known to be 0{^J\Vi\ + IV 2 I |E|), i.e., 0{\J\£\ + |V| |5|). 
Since S is assumed complete, we have \£\ = |V| then \J\£\ -|- |V| = \/2 ^/\£f\. 
Therefore, TCM must have running time at most 0{y/\^\ |i5|). □ 

Remark 2 Let iS(£^, V) be a complete structure. Then we know (cf. Proposition 
g that a total causal mapping over S exists. Let it be dehned ip = TCM(5). Then 
the causal ordering implicit in S shall be correctly extracted (cf. Proposition g 
by processing the causal dependencies induced by cp, as we show in Chapter g □ 

Now we are ready to accomplish the hypothesis encoding into fd’s, as we 


show next. 
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3.4. The Encoding Scheme 

We shall encode variables as relational attributes and map equations onto 
fd’s through total causal mappings. Let Z be a set of attribute symbols such that 
Z ~ V, where 5(£^, V) is a complete structure; and let 0, u ^ Z be two special 
attribute symbols kept to identify (resp.) phenomena and hypotheses. We are 
explicitly distinguishing symbols in Z, assigned by the user into structure S, from 
epistemological symbols 0 and v. Then we consider a sense of Simon’s into the 
nature of scientihc modeling and interventions [21], summarized in Def. 

Def. 8 Let 5(Z,V) be a structure and a; G V be a variable. We say that x is 
exogenous if there exists an equation f & S such that Vars{f) = {a:}. In this 
case / can be written f{x) = 0, and must be mapped to x in any total causal 
mapping (p over S. We say that x is endogenous otherwise. 

Remark introduces an interpretation of Def. with a data dependency flavor. 

Remark 3 The values of exogenous variables (attributes) are to be determined 
empirically, outside of the system (proposed structure S). Such values are, there¬ 
fore, dependent on the phenomenon id 0 only. The values of endogenous variables 
(attributes) are in turn to be determined theoretically, within the system. They 
are dependent on the hypothesis id v and shall be dependent on the phenomenon 
id 0 as well indirectly. □ 

As introduced in §2.2[ the encoding scheme we are presenting here is not 
obvious. It goes beyond Simon’s structural equations to abstract the data-level 
semantics of mathematical deterministic hypotheses. Whereas Simon’s structural 
equations are able to represent only linear equations, our encoding scheme can 
represent non-linear equations and arbitrarily complex mathematical operators by 
means of its data representation of deterministic hypotheses. 

For instance, take non-linear equation y = ax'^ and suppose that, considering 
the context of its complete system of equations, (Alg. g) TCM maps it onto variable 
y. Then, by an abstraction of the equation semantics, we shall encode it into fd 




3.4. THE ENCODING SCHEME 


44 


S = { 0 

-)■ 

Xi, 

0 

-)■ 

X 2 , 

0 

-)■ 

X3, 

Xi X2 X3 X5 V 

-)■ 

Xa, 

Xi X3 X4 V 

-)■ 

X5, 

X4 V 

-)■ 

X 6 , 

X5V 

-)■ 

X7 }• 


Figure 3.7. Encoded fd set S (cf. Alg.for the structure from Example]^ 


axv ^ y. That is, the hypothesis identiher v captures the data-level semantics of 
the hypothesis equation]^ 

We encode complete structures into fd sets by means of (Alg. h-encode. 
presents an fd set dehned S = h-encode(5), encoding the same structure 


Fig 


4.1 


S from Example 


Algorithm 3 Hypothesis encoding. 

1 

procedure H-ENCODE(iS : structure over £ and V, T> \ 

domain variables) 

Reqnire: S given is a complete structure, i.e., \£\ = V 
Ensure: Returns a non-redundant fd set S 


2 

E ^ 0 


3 

V? ^ JCM{S) 


4 

for all (/, x) G ipt do 


5 

Z X\ {x}, where (/, X) G S 


6 

it Z = 0 or Z CV then 

> X is exogenous 

7 

if X 0 P then > supress 0-fd for dimensions like time t 

8 

S S U (Z U {0}, x) 


9 

else 

> X is endogenous 

10 

E ^ S U (Z U {t;}, x) 


11 

return E 



Now we study the design-theoretic properties of the encoded fd sets. We 
shall make use of the concept of ‘canonical’ fd sets (also called ‘minimal’ [201 P- 
390]), see Def. 

Def. 9 Let S be an fd set. We say that S is canonical if: 


^ Note that, without the hypothesis id, infinitely many equations fit the pattern ax ^ y. 
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(a) each fd in E has the form X ^ A, where |y4| = 1; 

(b) For no (X, A) G E we have (E - {(X, A)})+ = E+; 

(c) for each fd X —)■ A in E, there is no X C X such that (E \ {X —)■ A} U {Y —)■ 
A})+ = E+. 


For an fd set satisfying such properties (Def. individually, we say that it is (a) 
singleton-rhs, (b) non-redundant and (c) left-reduced. It is said to have an attribute 
A in X that is ‘extraneous’ w.r.t. E if it is not left-reduced (Def. |9}c) m P- 74]. 
Finally, an fd X —)■ X in E is said trivial if X C X. Note that the presence of a 
trivial fd in a an fd set is sufficient to make it redundant. 


Theorem 2 Let E be an fd set dehned E = h-encode(iS) for some complete struc¬ 
ture S. Then E is non-redundant and singleton-rhs but may not be left-reduced 
(then may not be canonical). 


Proof 6 We show that properties (a-b) of Def. must hold for E produced by 
(Alg. h-encode, but property (c) may not hold (i.e., encoded fd set E may not 
be left-reduced). See Appendix, ^A.1.3 □ 


We draw attention to the signihcance of Theorem as it sheds light on a 
connection between Simon’s complete structures [23] and fd sets |20|. In fact, we 
continue to elaborate on that connection in next chapter to handle causal ordering 
processing symbolically by causal reasoning over fd’s. 


3.5. Experiments 


Fig. 3^ shows the results of experiments we have carried out in order to study 
how effective the procedure of hypothesis encoding is in practice, in particular 
its behavior for hypotheses whose structure S has been randomly generated over 
orders of magnitude |iS| ~ 2*’, to have length up to |iS| ~ 2^*^ < IM. The largest 
structure considered, with |5| ~ IM, has been generated to have exactly \£\ = 
2.5K. 
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We have executed ten runs for each tested order of magnitude, and then 
taken its mean running time in msj^ The plot is shown in Fig. 3^ in logscale 
base 2. In fact a a near-, sub-quadratic slope is expected for the curve structure 
length |iS| X time. 

These scalability results are compatible with the computational complexity 
of h-encode, which is (cf. Corollary]^ bounded by |iS|)|^ 



Figure 3.8. Performance of hypothesis encoding (in logscale). 


3.6. Related Work 

Modeling physical and socio-economical systems as a set of equations is a tradi¬ 
tional modeling approach, and a very large bulk of models exist up to date. Simon’s 
early work on structural equations and causal ordering comprises a specihc notion 
of causality aimed at further contributing to the potential of such modeling ap¬ 
proach (cf. |59]). It is meant for identifying influences among variables (or their 
values) that are implicit in the system model for enabling informed interventions. 
These may apply either to the system (phenomenon) under study, or to the model 
itself (say, when its predictions are not approximating observations very well). 

Signihcant research effort has been devoted to causal modeling and reasoning 
in the past decades in both statistics and AI (cf. [25l [19]). The notion of causality 
used can be traced back to the early work of Simon’s and others in Econometrics. 

® The experiments were performed on a 2.3GHZ/4GB Intel Gore i5 running Mac OS X 10.6.8. 

® Note that, for any arbitrary structure 5(£, V), we have \£\ < |5| < \E\^. So, in worst case 
(the densest structure possible) we have |5| = |£p and then can establish a time bound function 
of \S\ only, viz., 0{\£\'^^/\^\). 
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Nonetheless, there are two important differences to be emphasized: 

• Such work is majorly devoted to deal with (statistical) qualitative hypothe¬ 
ses, not (deterministic) quantitative hypotheses; 

• The causal model is assumed as given or is derived from data, instead of 
being converted or synthesized from a set of equations. 

These are both core differences that also apply to our work in comparison 
to the bulk of existing work in probabilistic DB’s. Our main point here, though, 
is to clarify the technical context and state of the art of the problem of causal 
ordering. A few works have been concerned with extracting a causal model out 
of some previous existing formal specification such as a set of equations. This is 
a reason why causal ordering has been an yet barely studied problem from the 
computational point of view. 

Dash and Druzdzel revisit the problem and re-motivate it in light of modern 
applications [H]. First, they provide a formal description of how Simon’s COA 
gives a summary of the causal dependencies implicit in a given SEM. That is, 
in clustering the strongly coupled variables into a causal graph, COA provides a 
condensed representation of the causal model implicit in the given SEM. They 
show then that any valid total causal mapping produced for a given SEM must be 
consistent with COA’s partial causal mapping. 

Yet, the serious problem is that the algorithm turns out to be intractable. In 
fact, no formal study of COA’s computational properties can yet be found in the 
literature. In this thesis we have obtained the (negative) hardness result that it is 
intractable, which turns out to be compatible with Nayak’s intuition (sic.) that it 
is a worst-case exponential time algorithm (cf. [60l p. 37]). 

Inspired on Serrano and Gossard’s work on constraint modeling and reason¬ 
ing [61], Nayak reports an approach that is provably quite effective to process the 
causal ordering: extract a total causal mapping and then compute the transitive 
closure of the direct causal dependencies. In this thesis we build upon it to per¬ 
form causal reasoning in terms of a form of transitive reasoning. Such approach 
fits very well to our use case, viz., the synthesis of p-DB’s from fd’s. As we show 
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in Chapter we process the causal ordering of a hypothesis structure (abstracted 
as a SEM) in terms of acyclic causal reasoning over fd’s and prove its correctness. 
This is enabled by the encoding scheme presented in this chapter. 


3.7. Summary of Results 

In this chapter we have studied and developed an encoding scheme to process 
the mathematical structure of a deterministic hypothesis into a set of fd’s towards 
the encoding of hypotheses ‘as data.’ Then we have studied the design-theoretic 
properties held by such an encoded fd set. We list the results achieved as follows. 


• By Theorem we know (an original hardness result) that Simon’s ap¬ 
proach to process the causal ordering of a structure is intractable; 


• By building upon on the work of Simon [24] and Nayak |49| (cf. Proposi¬ 
tions!^ and |^, we have framed an approach to efficiently extract the basic 
information (a total causal mapping) for processing the causal ordering 
implicit in the mathematical structure of a deterministic hypothesis; 

• By Corollary]^ we know how to process the complete structure of a hypoth¬ 
esis into a total causal mapping in time that is bounded by 0{^/\£\ |5|). 
That is, the machinery of hypothesis encoding is provably suitable for very 
large hypothesis structures. 


• By Theorem]^ which studies the design-theoretic properties of the encoded 
fd sets, we have unraveled the connection between Simon’s complete struc¬ 
tures and fd sets to further explore it in next chapter. 


We have performed experiments (cf. Fig. 3.8) to study how effective the 
approach is in practice, or how it scales for hypotheses whose structure S 
is randomly generated to have length up to the order of |iS| < 


The tests were up to this order only because of the hardware limitations of our experimental 
settings. In theory (cf. complexity time bounds), larger structures can be handled very efficiently. 





Chapter 4 


Causal Reasoning over FD’s 


In this chapter we present a technique to address the problem of causal order- 

we 


ing processing in order to enable the synthesis of U-relational DB’s. In ^4.1 


introduce Armstrong’s classical inference system to reason over fd’s. In ^4.2 


we 


develop the core concept and algorithm of the folding of an fd set, as a method for 


acyclic causal reasoning over fd’s. In ^4.3 we show its connections (equivalence) 


with causal reasoning. In (4.4 we present experiments on how the method behaves 


in practice. (4.5 we discuss related work. In (4.6 we conclude the chapter. 


4.1. Preliminaries: Armstrong’s Inference Rules 

As usual notational conventions from the DB literature [201 EH, we write X, Y, Z 
to denote sets of relational attributes and A, B,C to denote singleton attribute 
sets. Also, we write XY as shorthand for X VJY. 

Functional dependency theory relies on Armstrong’s inference rules (or ax¬ 
ioms) of (RO) reflexivity, (Rl) augmentation and (R2) transitivity, which forms 
a sound and complete inference system for reasoning over fd’s |20]. From R0-R2 
one can derive additional rules, viz., (R3) decomposition, (R4) union and (R5) 
pseudo-transitivity. 

RO. If FC X, then X^Y] 

Rl. liX^Y, then XZ^ YZ; 

R2. If Fand F^ IF, then X^ IF; 

R3. If X^ YZ, then X^ F and X^ Z; 

R4. If X^ F and X^ Z, then X^ YZ; 
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R5. If Fand IF, then IF. 

Given an fd set S, one can obtain E"*", the closure of S, by a finite application of 
rules R0-R5. We are concerned with reasoning over an fd set in order to process 
its implicit causal ordering. The latter, as we shall see in §4.3[ can be performed 
in terms of (pseudo-)transitive reasoning. Note that R2 is a particular case of R5 
when Z = 0, then we shall refer to R5 reasoning and understand R2 included. The 
next definition opens up a way to compute E+ very efficiently. 

Let E be an fd set on attributes U, with X C F. Then X"*", the attribute 
closure of X w.r.t. E, is the set of attributes A such that {X,A) G E+. Bernstein 
has long given algorithm (Alg. XCIosure to compute X"*". It is polynomial time 
in |E| ■ \U\ (cf. [62]), where E and U are (resp.) the given fd set and the attribute 
set over which it is defined. A tighter time bound (linear time in |E| ■ \U\) is 
achievable as discussed further in Remark HI 


Algorithm 4 Attribute closure X+ (cf. [20] p. 388]). 

1 

procedure XClosure(E: fd set, X: attribute set) 


Require: E is an fd set, X is a non-empty attribute set 
Ensure: X+ is the attribute closure of X w.r.t. E 


2 

size -(r- 0 


3 

A ^ 0 


4 

X+ ^ X 


5 

while size < X+ do 


6 

size ^ X+ 


7 

E ^ E \ A 


8 

for all (F, F) e E do 


9 

if F C X+ then 


10 

A ^ AU{(F, Z)} 

> consumes fd 

11 

X+ ^ X+ U Z 


12 

return X’*' 



4.2. Acyclic Pseudo-Transitive Reasoning 

As discussed in the previous chapter, we shall process the causal ordering in terms 
of computing the transitive closure of each endogenous variable (predictive at¬ 
tribute). Before we proceed to that, we shall develop some machinery to reason 
over fd’s in terms of Armstrong’s rule R5 (pseudo-transitivity). We shall then 
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demonstrate the correspondence between this kind of reasoning with causal rea¬ 
soning shortly in the sequel. 

Def. 10 Let S be a set of fd’s on attributes U. Then Tf’, the pseudo-transitive 
closure of S, is the minimal set 3 E such that X —)• T is in E^, with XY C U, 
iff it can be derived from a hnite (possibly empty) application of rule R5 over fd’s 
in E. In that case, we may write X Y and omit ‘w.r.t. E’ if it can be understood 
from the context. 

We are in fact interested in a very specihc proper subset of E^, say, a kernel 
of fd’s in E^ that gives a “compact” representation of the causal ordering implicit 
in E. Note that, to characterize such special subset we shall need to be careful 
w.r.t. the presence of cycles in the causal ordering. 


Def. 11 Let E be a set of fd’s on attributes U, and {X,A) e E^with XA C U. 
We say that X—)-74 is folded (w.r.t. E), and write X ^ A, if it is non-trivial 
and for no X C 17 with X ^ X, we have X —)■ X and X 74 X in E+. 


The intuition of Def. 11 is that an fd is folded when there is no sense in going 
on with pseudo-transitive reasoning over it anymore (nothing new is discovered). 
Given an fd X —)■ A in fd set E, we shall be able to hnd some folded fd Z A 
by applying (R5) pseudo-transitivity as much as possible while ruling out cyclic or 
trivial fd’s in some clever way. 


Def. 12 Let E be an fd set on attributes U, and (X, A) G E be an fd with XA C U. 
Then, 

(a) A°^, the (attribute) folding of A (w.r.t. E) is an attribute set Z <zU such 
that Z ^ A] 

(b) Accordingly, E'’^, the folding of E, is a proper subset E’’^ C E^ such that 
an fd (Z, A) G E^ is in E'’^ iff X A for some Z C U. 


Example 4 (continued). Fig. 4.1 shows an fd set E (left) and its folding E 




(right). Note that the folding can be obtained by computing the attribute folding 
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S = { 0 

-)■ 

Xi, 


-)■ 

Xi, 

0 

-)■ 

X2, 

0 

-)■ 

X2, 

0 

-)■ 


0 

-)■ 


Xi X2 X3 X3 V 

-)■ 

X4, 

(j)V X3 

-)■ 

Xi, 

Xi X3 0:4 V 

-)■ 

X5, 

(f)V Xi 

-)■ 


Xi V 

-)■ 

2^6, 

<pv X3 

-)■ 


X5V 

-)■ 

X7 }. 

(f)V Xi 

-)■ 

X7 }. 


Figure 4.1. Fd set S encoding (cf. Alg. the structure of Fig. 3.2a and its 
folding derived by Alg. 


for A in each fd X —)■ A in S. We illustrate below some reasoning steps to partially 
compute an attribute folding considering the subset of fd’s in E with (j) ^ X. 

1 . X1X2XSX5V —)■ X4 [given] 

2 . xiX^X4^v —>• x^ [given] 

3. X5V ^ X 7 [given] 

4. xiXsXiV —)■ X 7 [R5 over (2), (3)] 

5. X1X2XSX5V —)■ xy [R5 over (1), (4)] 

6 . X 1 X 2 x^x^v —)■ X 7 ]R5 over (2), (5)]. 

Note that (6) is still amenable to further application of R5, say over (1), to 
derive (7) X 1 X 2 X 3 X 5 V ^ Xf. However, even though (1) and (6) have (resp.) the 
form (1) X —)■ A and (6) H —)■ H with Y A- X, we have X X as well which 
characterizes a cycle that fetches nothing into In fact, if we consider only the 
fd’s 1-3 given, then (6) itself satishes Def. and then is folded. The same holds, 
e.g., for (1) by an empty application of R5. □ 


Lemma 2 Let V) be a complete structure, (p a total causal mapping over S 
and S an fd set encoded through p given S. If (X, A) G S, then A'’^, the attribute 
folding of A (w.r.t. S) exists and is unique. 


Proof 7 See Appendix, §A.2.1 


□ 


^ Note, when at the step deriving (6) by R5 over (2), (5), that such cycle was not yet formed. 
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We give an original algorithm (Alg. to compute the folding of an fd set. At its 
core there lies (Alg. AFolding, which can be understood as a non-obvious variant 
of XCIosure (cf. Alg. designed for acyclic pseudo-transitivity reasoning. In order 
to compute the folding of attribute A in fd (X, A) G S, algorithm AFolding can be 
seen as backtracing the causal ordering implicit in S towards A. Analogously, in 


terms of the directed graph induced by the causal ordering (see Fig. 3.3), that 
would comprise graph traversal to identify the nodes x G V that have G V in 
their reachability, i.e., x Xq. Rather, AFolding’s processing of the causal ordering 
is fully symbolic based on Armstrong’s rewrite rule R5. 


Example 5 Cyclicity in an fd set S may have the effect of making its folding 
to degenerate to S itself. For instance, consider S = {A—;• B, A]. Note that 
S is canonical, and AFolding (w.r.t. S) is B given A, and A given B. That is, 
S‘^= S. □ 


Algorithm 5 Folding of an fd set. 

1 : procedure F 0 LDING(E: fd set) 

Require: S given encodes complete structure S 
Ensure: Returns fd set the folding of S 
2: E'^ ^ 0 

3: for all (X, A) G E do 

4: Z AFolding(E, A) 

5: E'^ ^ E'^ U (Z, A) 

6: return E'’^ 
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Algorithm 6 Folding of an attribute w.r.t. an fd set. 


1: procedure AFolding(S: fd set, A: attribute) 

Require: S is parsimonious 

Ensure: Returns A^, the attribute folding of A (w.r.t. S) 

2: A 0 > consumed fd’s 

3: A 0 > consumed attrs. 

4: A* <(— A > stores attrs. A is found to be ‘causally dependent’ on (cf. ( |4.3[ ) 

5: size -ir- 0 

6: while size < |A*| do > halts when = A*^*^ 

7: size •(— |A*| 

8: S ^ S \ A 

9: for all (F, R) G S do 

10: if R G A* then 

11: AA U {(y, R)} > consumes fd 

12: A* ^ A* U F 

13: A ^ A U R > consumes attr. 

14: for all C G F do 

15: if C* G A and R G A for {X, C) G A then > cyclic fd 

16: A ^ A \ R > reingests it to simulate cyclic app. of R5 

17: return A* \ A 


Theorem 3 Let V) be a complete structure, and S an fd set encoded given 
S. Now, let (X, A) G E. Then AFolding(S, A) correctly computes A'’^, the attribute 
folding of A (w.r.t. E) in time 0(|Rp). 


Proof 8 For the proof roadmap, note that AFolding is monotone (size of A* can 
only increases) and terminates precisely when Ah+^) = Ah), where Ah) denotes 
the attributes in A* at step i of the outer loop. The folding A'’^ of A at step i 
is Ah) \ Ah). We shall prove by induction, given attribute A from fd X —)■ A in 
parsimonious E, that A*\ A returned by AFolding(E, A) is the unique attribute 
folding A"'^ of A. See Appendix, §A.2.2 □ 


Remark 4 Let E be an arbitrary fd set on attribute set U. Beeri and Bernstein 
gave a straightforward optimization to (Alg. XCIosure to make it linear in |E| ■ |R| 
(cf. [Ml p. 43-5]), where |E| ■ \U\ is the maximum length for a string encoding all 
the fd’s. Note that the actual length of such string in our case is exactly |R|. The 
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optimization mentioned applies likewise to (Alg. AFoldingj^ That is, AFolding 
can be implemented to run in linear time in |iS|. □ 


Corollary 2 Let V) be a complete structure, and S an fd set encoded given 
S. Then algorithm folding(S) correctly computes the folding of S in time that 
is /(|5|) 0(|T|), where /(|5|) is the time complexity of (Alg. AFolding. 


Proof 9 See Appendix, §A.2.3 


□ 


Finally, it shall be convenient to come with a notion of parsimonious fd sets 


(see Def. 13), which is suggestive of a distinguishing feature of such mathematical 
information systems in comparison with arbitrary information systems. 


Def. 13 Let S be set of fd’s on attributes U. Then, we say that S is parsimo¬ 
nious if it is canonical and, for all fd’s {X,A) G S with XA O U, there is no 
Y CU such that Y ^ X and {Y, A) e S. 

Proposition then shall be useful further in connection with the concept of 
the folding. 


Proposition 3 Let 5(T,V) be a complete structure, ip a total causal mapping 
over S and S an fd set encoded through p given S. Let be the folding of S, 
then is parsimonious. 


Proof 10 See Appendix, §A.2.4 


□ 


4.3. Equivalence with Causal Ordering 

Now we show the equivalence of acyclic pseudo-transitive reasoning with causal 
ordering processing. We start with Theorem which establishes the equivalence 
between the notion of causal dependency and the fd encoding scheme presented in 
Chapter 

^ We omit its tedious exposure here. In short, it shall require one more auxiliary data structure 
to keep track, for each fd not yet consumed, of how many attributes not yet consumed appear in 
its rhs. 
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Theorem 4 Let iS(£^, V) be a complete structure, tf a total causal mapping over 
S and S an fd set encoded through (p given S. Then Xa, x;, G V are such that 
Xft is causally dependent on Xq, i.e., {xa,Xb) G iff there is some non-trivial fd 
{X, B) G with A e X, where B ^ Xb and A !->• Xa- 


Proof 11 We prove the statement by induction. We consider hrst the ‘if’ direc¬ 
tion, and then its ‘only if’ converse. See Appendix §A.2.5 


□ 


Def. then gives useful terminology for a neat concept towards our goal in 
this chapter. 


Def. 14 Let 5(T, V) be a structure with variables Xa, Xb G V, and ip a total causal 
mapping over S inducing set of direct causal dependencies and its transitive 
closure We say that Xq is a first cause of Xb in S if {xa,Xb) G and, for no 
X G V, we have (x, Xa) G C^. 

Proposition connects the notion of hrst cause with those of exogenous and 
endogenous variables introduced in Chapter 

Proposition 4 Let iS(T, V) be a structure with variable x G V. Then x can only 
be a hrst cause of some ?/ G V if x is exogenous. Accordingly, any variable y E V 
can only have some hrst cause x G V if it is endogenous. 


Proof 12 Straightforward from dehnitions, see Appendix §A.2.6 


□ 


Note that exogenous variables are encoded into fd’s X —)■ A with 0 G X. 
Since the values of such variables are assigned “outside” the system (cf. Remark 
1^, they are devoid of indirect causal dependencies and then have no uncertainty 
except for their own. Thus, we have not to be concerned at all with processing 
the causal (uncertainty) chaining towards them. Our goal is rather hnd the hrst 
causes of the endogenous variables (predictive attributes). 

We shall need then the terminology of Def. and then we introduce Lemma 
[^paving the way to our goal. 


Def. 15 Let 5 be a structure, and S be a set of fd’s encoded over it. Then T(S), 
the u-projection of S, is the subset of fd’s X —)■ A such that u G X. Accordingly, 
<h(S), the (/^projection of S, is the subset of fd’s X —)■ A such that v ^ X. 
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Fig. |4.2| illustrates the u-projection of an fd set and the folding applied over 
such fd subset in order to compute the hrst causes of endogenous variables. 


Lemma 3 Let V) be a complete structure, a total causal mapping over S 
and S an fd set encoded through tp given S. Then a variable G V can only be 
a hrst cause of some variable Xb G V, where {X, B) G S, and B ^ Xb, A ^ Xa, if 
either (i) A G X or (ii) A ^ X but there is {Z, C) G with A E Z and C E X. 


Proof 13 We prove the statement by construction out of Theorem|^ see Appendix 
□ 


Finally, Theoremj^further clarihes the purpose of the folding and its meaning 
in terms of causal ordering. 


Theorem 5 Let V) be a complete structure, ip a total causal mapping over 
S and E an fd set encoded through p given S. Now, let B be an attribute that 
encodes some variable Xb E V. If {X, B) E T(E)"^j^ then every hrst cause Xa of Xb 
(if any) is encoded by some attribute A G X. 


Proof 14 We show that the existance of a missing hrst cause Xc of Xb for folded 
X —> B, where B Xb and C ^ Xc but C ^ X leads to a contradiction. See 
Appendix §A.2.8| □ 

^ Note that the folding is taken w.r.t. the n-projection of S, then Xb where B Xb is an 
endogenous variable. 
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Remark 5 Observe that, on the one hand, the goal of computing the transitive 
closure of a set of induced causal dependencies is to derive the entire causal 
ordering of a given structure. The goal of folding, on the other hand, is not to 
discover all variables (attributes) a given variable (attribute) is causally dependent 
on, but only all of its hrst causes (if any). □ 

In particular, the results just shown comprise a method to compute, for each 
endogenous variable (predictive attribute), all of its hrst causes. This is a core goal 
of the reasoning device developed in this chapter in order to enable the automatic 
synthesis of hypotheses as uncertain and probabilistic (U-relational) data. 


4.4. Experiments 


Fig. |4.3| shows the results of experiments we have carried out in order to study 
how effective the causal reasoning over fd’s is in practice, in particular its behavior 
for hypotheses whose structure S has been randomly generated over orders of 
magnitude |5| ~ 2^, to have length up to |5| ~ 2^*^ < IM. The largest structure 
considered, with |5| IM, has been generated to have exactly \£\ = 2.5K. Like 
in the experiments of the previous chapter, we have executed ten runs for each 
tested order of magnitude, and then taken its mean running time in msj^ 


The plot is shown in Fig. |3.8| is in logscale base 2. Notice the linear rate 
of growth across orders of magnitude (base 2) from IK to IM sized structures. 
For a growth factor of 2 in structure length (doubled), the time required by causal 
reasoning grows a factor of 2 (doubled as well). These scalability results are com¬ 
patible with the computational complexity of folding, which is bounded by 0(|iSp). 
Yet, that is a bit overestimated time bound as we see in the plot of Fig. |4.3[ 


4.5. Related Work 

The concept of fd set folding and the design of (Alg.|^ AFolding as a not quite 
obvious variant of XCIosure, is an original approach to the problem of processing 
the causal ordering of a hypothesis via acyclic pseudo-transitive reasoning over 
fd’s. To the best of our knowledge, such a specihc form of fd reasoning was an 


The experiments were performed on a 2.3GHZ/4GB Intel Gore i5 running Mac OS X 10.6.8. 
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Structure length |5 


Figure 4.3. Performance of acyclic causal reasoning over fd’s (logscale). 


yet unexplored problem in the database research literature (reasoning over fd’s is 
extensively covered in Maier |22j). 

Recent years have seen the emergence of some foundational work in causality 
in databases [6l]. It is motivated for improving DB usability in terms of providing 
users with explanations to query answers (and non-answers). Essentially, the idea 


is borrowed from AI work on causality (cf. ^3.6 ) to identify causal ordering between 
tuples. Given a query and its result set, the system should be able to explain to the 
user what tuples ‘caused’ that answer, or why possibly expected tuples are missing. 
That requires causal chain of tuples for a given query, which can be computationally 
expensive as the database instance can be very large. For conjunctive queries, the 
causality is said to be computed very efficiently [65] . 

A more specihc problem addressed by Kanagal et. al is the so-called sensitiv¬ 
ity analysis |66] , which is aimed at establishing a more rehned connection between 
the query answer (output) and elements of the DB instance (input) for supporting 
user interventions. Instead of providing the user with causes and non-causes, the 
goal is to enable the user to know how changes in the input affect the output. This 
line of work is strongly related to the vision of ‘reverse data management ’ Ea. 

Causal reasoning in the presence of constraints (viz., fd’s) is an yet unex¬ 
plored topic, though called for as worth of future work by Meliou et al. [Ml P- 3]. 
The fd’s are rich information that can be exploited for the sake of explanation and 
sensitivity analysis. Once they are available, it is intuitive that the search space of 
such problems shall be signihcantly reduced. 
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In fact, our encoding of equations into fd’s captures the causal chain from 
exogenous (input) to endogenous (output) tuples in at schema level. Nonetheless 
our form of causal reasoning over fd’s is geared for hypothesis management and 
analytics, from a uncertainty management point of view. A concrete connection 
to causality in DB’s is not yet established. 


4.6. Summary of Results 

In this chapter we have studied and developed a technique for acyclic causal 
reasoning over fd’s. We list the results achieved as follows. 

• We have developed principled concepts and a core algorithm, viz., (Alg. 
1^ the folding, in order to perform acyclic pseudo-transitive reasoning over 
fd’s. This is towards an efficient method for causal reasoning, yet ele¬ 
gant as a database formalism for the systematic construction of hypothesis 
probabilistic DB’s. 

• We have given a reasonably tight time bound for the behavior of such 
reasoning device in terms of the structure given as input. We have es¬ 
tablished (cf. Theorem]^ Corollary]^ the time bound of (9(|iSp) for the 
folding algorithm. 


• We have shown the correctness of the folding algorithm in connection with 
causal reasoning (cf. Theorem]^ Theorem]^. 

• We have dehned the core notion of hrst causes (cf. Def. Proposi¬ 
tion]^, which is meant to guide the procedure of U-intro (Chapter]^ by 
precisely capturing the uncertainty factors on endogenous variables (pre¬ 
dictive attributes). This is similar, yet markedly different from computing 
the transitive closure of causal dependencies (cf. Remark]^. 


We have performed experiments (cf. Fig. 4.3) to study how effective the 
approach of causal reasoning over fd’s is in practice, or how it scales for 
hypotheses whose structure S is randomly generated to have length up 
to the order of |5| < IM. The experiments show that the time bound, 
though already effective for very large structures, are a bit overestimated. 




Chapter 5 


Probabilistic Database Synthesis 


In this chapter we present a technique to synthesize hypothesis U-relations. At 
this stage of the pipeline, relational schema H is loaded with datasets computed 
from the hypotheses under alternative trials (input settings). The challenge is how 
to model or design its probabilistic version (i.e., render the U-relations Y) so that 
it is suitable for data-driven hypothesis management and analytics. 


In 15T we introduce U-relational DB’s. In §5.2| we present a running example 
to illustrate the uncertainty introduction procedure (U-intro in the pipeline, cf. Fig. 


1.4). In 15.3 we present the technique to factorize the uncertainty present in the 


‘big’ fact table in terms of the well-dehned uncertainty factors. In 1 5.4| then we 
show how to propagate such uncertainty into the predictive attributes properly, 
i.e., based on their hrst causes detected as shown in Chapter]^ In §5.6[ we discuss 
related work. Finally, in 15T we conclude the chapter. 


5.1. Preliminaries: U-Relations and Probabilistic WSA 

Three remarkable features of U-relations are: expressiveness (being closed under 
positive relational algebra queries); succinctness (efficient storage of a very large 
number of possible worlds through vertical decompositions to support attribute- 
level uncertainty); and efficient query processing (including confidence computa¬ 
tion) [Is] . 

A U-relational database or U-database is a finite set of structures, 

FU = { {R\, ..., Rl, pW), ..., ...,R^, pN ) }, 


^ We postpone the presentation of experiments on p-DB synthesis (U-intro as a whole) to (6.4 
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of relations R\, ..., R]^ and numbers 0 < pW < i such that ~ 

element R\^ ..., pW g VF is a possible world, withpW being its probability [18]. 

Probabilistic world-set algebra (p-WSA) consists of the operations of rela¬ 
tional algebra, an operation for computing tuple conhdence conf, and the repair- 
key operation for introducing uncertainty — by giving rise to alternative worlds as 
maximal-subset repairs of an argument key [T8] . 

Let Ri[U] be a relation, and XA C U. For each possible world (i?i, ...,Rm, 
p) G W, let A E U contain only numerical values greater than zero and let R^ 
satisfy the fd {U \ A) ^ U. Then, repair-key is: 


|repair-key^@^(i?£)](W) {Ri,Ri, Rm, Re[U \ A], p) \, 


where {Ri, ...,Re, Rm, p) G W, Ri is a maximal repair of fd X —)■ [/ in R^, and 

TT 

p = p YY 




E 


s£ R^: s.X = t.X 


s.B 


U-relations (cf. Fig. 5.1) have in their schema a set of pairs (Vi, Di) of condition 


columns (cf. [18]) to map each discrete random variable Xj to one of its possible 
values (e.g., xii—)-l). The world table W stores their marginal probabilities (cf. 
the notion of pc-tables m Ch. 2]). For an illustration of the data transformation 


from certain to uncertain relations, consider query (5.1) in p-WSA’s extension of 


relational algebra, whose result set is materialized into U-relation Yq as shown in 

(Fig. 1^. 

E := 7r0,,;(repair-key^@conf(^o))- (5.1) 


Also, let R[ViDi \ sch{R) ], S[Vj Dj \ sch{S) ] be two U-relations, where R. U A is 
the union of all pairs of condition columns U Di in R, then operations of selection 
|cT^(i?)], projection |7r2(i?)], and product x S'] issued in relational algebra 
are rewritten in positive relational algebra on U-relations: 


l(T^{R)j :=a^{R[ViDi \sch{R)])-, 

\'Kz{R) I := '^yrjTiziD); 

I^XA] := 

'^{R.Vi Di U S.Vi Di)^V D U sch{R) U sch{S) ^R.Vi Di is consistent with S.Vj Dj ■ 
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Figure 5.1. U-relation generated by the repair-key operation. 


If R and S have k and i pairs of condition columns each, then |i? x SJ 
returns a U-relation with k + i such pairs. If A; = 0 or £ = 0 (or both), then R or 
S (or both) are classical relations, but the rewrite rules above apply accordingly. 
All that rewriting is parsimonious translation (sic. [H]): the number of algebraic 
operations does not increase and each of the operations selection, projection and 
product/join remains of the same kind. Query plans are hardly more complicated 
than the input queries. In fact, it has been verihed hat off-the-shelf relational 
database query optimizers do well in practice. 

For a comprehensive overview of U-relations and p-WSA we refer the reader 
to [18]. In this thesis we look at U-relations from the point of view of p-DB design, 
for which no methodology has yet been proposed. We are concerned in particular 
with hypothesis management applications [TO] . 


5.2. Running Example 

Before proceeding, we consider Example which is fairly representative 
to illustrate how to deal with correlations in the predictive data of deterministic 
hypotheses for the sake of suitable data-driven analytics. 


Example 6 We explore three slightly different theoretical models in population 


dynamics with applications in Ecology, Epidemics, Economics, etc: (5.2) Malthus’ 


model, (5.3) the logistic equation and (5.4) the Lotka-Volterra model. In practice, 
such equations are meant to be extracted from MathML-compliant XML hies (cf. 
Chapter 1^. For now, consider that the ordinary differential equation notation ‘h’ 
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is read ‘variable x is a function of time t given initial condition xq! 


X = bx 


( 5 . 2 ) 


X = 6(1 — x/K)x 


( 5 . 3 ) 


X = x{h — py) 
y = y{rx — d) 


( 5 . 4 ) 


The models are completed (by the user) with additional equations to provide the 
values of exogenous variables (or ‘input parameters ’)l!l e.g., Xq = 200, b = 10, such 
that we have SEM’s (resp.) SkiSk^Vk) for k = 1..3, 


• Si = {fi{t), /2(xo), fsib), U{x,t,xo,b)}-, 

• S2 = {fi{t), f2{xo), fsiK), fi{b), f5{x,t,Xo,K,b)}; 

• = f2{xo), fsib), Mp), f5{yo), Md), fiir), 

/8(x,t,xo,6,p,2/), U{y,t,yo,d,r,x) }. 


Fig. 5^ shows the fd sets encoded from structures Sk given abovej^ We also 
consider trial datasets for hypothesis v = 3 (viz., the Lotka-Volterra model), which 


are loaded into the ‘big’ fact table relation E H as shown in Fig. |5.3[ We admit 
a special attribute ‘trial id’ tid to keep hypothesis trials identihed until uncertainty 


is introduced in a controlled way by p-DB synthesis (U-intro stage, cf. Fig. 1.4). 
□ 


^ Given S{£, V), it is actually a task of the encoding algorithm (viz., sub-procedure TCM) to 
infer, for each variable x EV, whether it is exogenous or endogenous by means of the total causal 
mapping. 

^ Recall that domain variables like time t are informed to the encoding algorithm to suppress 
an fd t. 
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Figure 5.2. Fd sets encoded from the given structures Sk{Sk, Vk) for hypotheses 
k = 1..3 from Example 
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Figure 5.3. ‘Big’ fact table H^, of hypothesis fc = 3 from Example loaded with 
trial datasets identihed by special attribute tid. 


Given the ‘big’ fact table p-DB synthesis has two main parts: process 
the ‘empirical’ uncertainty present in the ‘big’ fact table and synthesize it out 
(decompose it) into independent u-factors {u-factorization); and then propagate it 
precisely into the predictive data {u-propagation). 
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5.3. U-Factorization 

As we have seen in 


1, the repair-key operation allows one to create a discrete 


random variable in order to repair an argument key in a given relation. Our goal 
here is to devise a technique to perform such operation in a principled way for 
hypothesis management. It is a basic design principle to have exactly one random 
variable for each distinct uncertainty factor (‘u-factor’ for short), which requires 
carefully identifying the actual sources of uncertainty present in relations H. 

The multiplicity of (competing) hypotheses is itself a standard one, viz., the 


theoretical u-factor. Consider an ‘explanation’ table like Hq in Fig. 5.1, which 
stores (as foreign keys) all hypotheses available and their target phenomena. We 
can take such Hq as explanation table for the three hypotheses of Example]^ Then 


a discrete random variable Vq is dehned into Tq [ ho -Do | 0 ] by query formula (5.1). 
U-relation Yq is considered standard in p-DB synthesis, as the repair of 0 as a key 
in (standard) Hq. 

Hypotheses, nonetheless, are (abstract) ‘universal statements’ [15]. In order 
to produce a (concrete) valuation over their endogenous attributes (predictions), 
one has to inquire into some particular ‘situated’ phenomenon 0 and tentatively 
assign a valuation over the exogenous attributes, which can be eventually tuned 
for a target 0. The multiplicity of such (competing) empirical estimations for a 
hypothesis k leads to Problem[T| viz., learning empirical u-factors for each Hk € H. 


Problem 1 Let be an fd set encoded given hypothesis structure iS^, and its 
‘big’ fact table relation loaded with trial data. Now, let Z be the set of attributes 
encoding exogenous variables in Hk, then the problem of u-factor learning is: 

(1) to infer in Hk ‘casual’ fd’s cjyBi^ Bj, (pBj —)■ Bi not in (strong input 
correlations), where B^, Bj ^ Z] 

(2) to form maximal groups Gi, ..., Gn C Z of attributes such that for all 
Bi, Bj G Ga, the casual fd’s (()Bi^ Bj and (pBj^ B^ hold in Hk, 

(3) to pick, for each group Ga, any A G Ga as a pivot representative and 
insert cpA^B into an fd set Qk for all B G {Ga \ A). 
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Figure 5.4. ‘Big’ fact table of hypothesis fc = 3 from Example with u-factors 
{b, d} and {p, r} emphasized (resp.) in colors green and red. 


U-factor learning is meant to process only the attribntes Z G U from Hk[U] that 
are inferred exogenous in the given hypothesis, i.e., for all A E Z, there is an fd 
{X,A) G ‘h(Efc), where the latter is the 0-projection of Such attributes are 
then ‘officially’ unrelated. In fact, by ‘casual’ fd’s we mean correlations that, for a 
set of experimental trials, may occasionally show up in the trial input data; e.g., 
^0 t-G Vo hold in H^, but not because Xq and Hq are related in principle (theory). 

Fig. 5^ helps to illustrate Problem [T] through the ‘big’ fact table. We 
emphasize u-factors {b, d} and {p, r} in colors green and red. Observe that values 
of b are strongly correlated (one-to-one) with values of d for 0 = 1, just like p and 
r. Note also that {xq, yo] can be seen as a certain factor. From the user point 
of view, this is a record that reflects a common practice in computational science 
known as (parameter) sensibility analysis. 

Problem]^ is dominated by the (problem of) discovery of fd’s in a relation, 
which is not really a new problem (e.g., see [68]). We then keep focus on the 
synthesis method as a whole and omit our detailed u-factor-learning algorithm in 
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Figure 5.5 

Fd set 123 

(compare with S3) and its folding 12^. 


particular]^ Its output, fd set flfc, is then filled in (completed) with the v- 
projection T(Sfc). For illustration consider hypothesis v = 3 and its trial input 
data recorded in in Fig. 5.3 We show its resulting fd set his in Fig. 5^ (left), 
together with its folding fll* (right). The latter is then input to (Alg. merge 
to get the final information necessary for the actual synthesis of U-relations, as 


captured in Def. IT For an illustration of the merging of fd’s with equivalent 
left-hand sides, note in Fig. |5.5| (right) that (px^bpivy -H- (px^bpivx holds in 




Def. 16 Let Sk and be the complete structure and ‘big’ fact table of hypothesis 
fc, and Sfc an fd set defined = h-encode(iSfc). Now, let hlfc — u-factor-learning( Hk, 
<h(Sfc)) U T(Sfc), and 12^ be the folding of ilk- Finally, define F*. = merge(r2^ ). 
Then we say that F^ is the u-factorization of Sk over Hk- 


Algorithm 7 Merge fd’s with equivalent left-hand sides. 

1: procedure MERGE(S : fd set) 

2 : n ^ 0 

3: for all (X, (F) G E do 

4: if there is {Z, W) G 12 such that X -H- Z holds in S’*" then 

5: 12 ^ \ (Z, W) 

6: S^X\Z 

7: 12 12 U (Z, W SC) > merges equivalent keys 

8: else 

9: 12^f2U(X,C') 

10: return 12 


^ In short, we make use of relational algebra group-by operation and build a pruned lattice of 
attribute groups having the same number of rows under the grouping (similarly to [68)1. 















5.3. U-FACTORIZATION 


69 


Remark 6 Let be the u-factorization of structure Sk over ‘big’ fact table Hk- 
Then every fd in T^ encodes a clear-cut claim, either empirical, in <h(rfc), or the¬ 
oretical, in T(rfc). That is ensured by the merge algorithm, which groups fd’s in 
with equivalent left-hand sides. □ 


We are now able to employ a notion of u-factor decomposition formulated in 


Def. 17 into query formula (5.5) in p-WSA’s extension of relational algebra. 


Def. 17 Let Sk be the complete structure of hypothesis k, and Hk[U] its ‘big’ fact 
table such that T^ is the u-factorization of Sk over Hk. Now, let Ga C U he a. set 
of attributes Ga = AG such that, for all i? G G, an fd 0A —)■ R exists in ^(Tk). 


Then we dehne U-relation Yj^ [Vi Di\ (j) AG] by query formula (5.5), and say that 

is a u-factor projection of Hk. 


Yl := 7r0AG(repair-key^@„„„t(7^,A,G,co™t(q(L7fc))) (5.5) 

where 7 is relational algebra’s grouping operator. 
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Figure 5.6. U-factor projections rendered for hypothesis n = 3. 


The synthesis of u-factor projections, in particular the application of repair-key (cf. 


Eq. 5.5), has an important consequence for the u-factorization T^ of Hk, viz., the 


introduction of new fd’s into Tj. dehned as follows (see Def. 18). We shall consider 


it (rather than Tk) to study design-theoretic properties of synthesized in 15.5 


Def. 18 Let Sk and Hk[U] be (resp.) the complete structure and ‘big’ fact table 
of hypothesis k, and T^. be the u-factorization of Sk over Hk. Now, let Tj. = 
AjGj} UTfe, where I indexes all u-factor pro j ections Tj! [ U D j | 0 A j G * ] 
of Hk. We say that Tj, is the repaired factorization of Sk over Hk. 
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5.4. U-Propagation 

U-propagation is a central part of U-intro and the pipeline itself. Recall that all 
the machinery developed so far, from hypothesis encoding to cansal reasoning to 
n-factorization is for enabling predictive analytics. Let us briefly reconstruct it. 

For hypothesis structure Sk{S,V), take any endogenous variable Xc G V en¬ 
coded by attribute C h->• Xc- There should be exactly one fd {X^C) G T(Sfc)R 
By Theorem for every hrst cause Xh of Xc there is R G X where B ^ Xb with 
Xb G V. Now, observe that when Qj. is rendered by u-factor learning, it is hlled 
partly with fd’s from T(Sfc), and partly with fd’s processed from $(Sfc). This is 
to summarize exogenous variables into clear-cut independent u-factors. It means 
that, after u-factor learning, each first cause Xb encoded by B shall be represented 
by some pivot attribute Ai which is, if not Ai = B itself, then occasionally strongly 
correlated to it (i.e., (f)Ai —)■ B and 0R—)■ At hold in Hk)- 

Further then VLk is subject to folding such that, if R G X for (X, C) G T(r2fc), 
now we have Ai ^ Z for {Z,C) G This processing from fd X —)■ C 

(where X contains the hrst causes) into Z ^ C (where Z contains only their pivot 
representatives) is meant for enabling an economical representation of uncertainty. 
Our running example is small, but such a principle is quite relevant for large-scale 
hypotheses (say, when |R| ~ IM). The correctness of such u-factor summarization 
shall be ensured by Proposition]^ which let us know that is parsimonious then 
(by Def. 13) canonical, therefore (by Def. left-reduced. 

Now, all fd’s in T(r2^) have form Z ^ C and we are almost ready for u- 
propagation. Note that, as a result of u-factorization, each pivot attribute Ai & Z 
is associated with random variable Vi from U-relation Yj! [ Rj | 0 Aj Gj ]. Then 
we shall use each Ai & Z (from Z —)■ G) as a surrogate to Ij! in order to propagate 


factorized uncertainty into ‘predictive’ U-relations Yl[Vj Dj\ST] by a join for¬ 
mula. Attribute sets S and T are dehned after merging fd’s in with equivalent 
Ihs to get F k and pass it as argument for synthesis. We \et S = Z \ W such that 
S contains the domain variables only (e.g., 0, n, t). The pivot attributes in Z 
shall not be included in the data columns of but leave their trace through the 


condition columns Vj Dj that annotate sch{Yj!) as a repair of the key S'—)■ T. 
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All that (cf. Def. is abstracted into general p-WSA query formula (5.6), 
and employed in (Alg. [^) synthesize to accomplish u-propagation (Part II). 

Def. 19 Let Sk be the complete structure of hypothesis k, and Hk[U] its ‘big’ fact 
table such that Pfe is the u-factorization of Sk over Hj.. Now, let Z ^ T be an fd 
in T(rfc). Then we dehne U-relation [Vj Dj | S'T] by query formula (5.6), and 
say that Yi is a predictive projection of Hk, where: 


h'fc := T^s,T{,(Tv=kiYQ) txi ([Xi^gj Ij!) CXI Hk) 


(5.6) 


(a) Ij* [Dj I 0 Ai Gi] is a u-factor projection of Hk] 

(a) we have i G / if Aj G Z; 

(c) we take S = Z \ W, where W is the set of all pivot attributes representing 
hrst causes in Z. 


Algorithm 8 p-DB synthesis applied over folding fd set. 

1 

procedure SYNTHESlZE(iSfc : structure, Hk ■ ‘big’ table, Yq : explan, table) 

Require: Sk is complete 

Ensure: U-relations Yk returned are a 

BCNF, lossless decomposition of Hk 

2 

Sfc h-encode(5fc) 


3 

Dfc u-factor-learning( <h(Sfc), H 

k) U T(S,) 

4 

Tk ^ merge(folding(Dfc)) 



Part I: U-factorization 


5 

for all {4>A,G) G *h(rfc) do 

> scans over the u-factors of hypothesis k 

6 

Yi ^ T^<f>,A,G (repair-key 

( A, G, count{^) ) ) 

7 

Yk^YkU Yl 



Part II: U-propagation 


8 

for all (Z,T) G T(rfc) do 

> scans over the claims of hypothesis k 

9 

VP ^ 0 > prepares to keep track of u-factor pivot attributes 

10 

for all i;'[0AjGj] G Yk do 


11 

if A G Z then 

> A is a hrst cause of all D G T 

12 

/ = /U{i} 

> indexes the u-factor projection 

13 

IT ^ IPU A 

> keeps track of u-factor’s pivot attribute 

14 

S^Z\W 

> removes u-factor pivot attributes 

15 

Yl ^ 7rs,T ( (Tv=k{Yo) txi (txi jgj Ij!) CXI Hk ) 

16 

Yk^YkU Yi 


17 

return Yk 
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Fig. 5.7 shows the rendered U-relations for hypothesis k = 3 whose ‘big’ 


fact table is shown in Fig. 5.3 Note that tid = 6 in corresponds now to 
9 = { xi I—>■ 3, X2 H-)■ 1, X3 I—)■ 3, X4 I—)■ 2 }, where 9 dehnes a particular world in 
W whose probability is Pr(6') .055. This value is derived from the marginal 


probabilities stored in world table W (see Fig. 5.7) as a result of the application 


of formulas Eq. 5.1 and Eq. 5.5 


Remark 7 Observe that, although (Alg. synthesize operates locally for each 
hypothesis k, the effects of p-DB synthesis (U-intro) in the pipeline are global on 
account of the (global) ‘explanation’ relation Hq (then U-relation Eo)) e.g., see Fig. 


for hypothesis v = k, is distributed among all the hypotheses £ ^ k that are keyed 
in Yq under 0 = p, i.e., all hypotheses that compete at 0 = p. □ 


5.7 


In fact, the probability of each tuple (row), say, in U-relation with (j) = p 
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Figure 5.7. U-relations rendered for hypothesis v = 3. 
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U-relations rendered by p-DB synthesis are ready for qnerying. Typical 
queries comprise the conf() aggregate operation, inquiring the probability (or con- 
hdence) for each tuple to ‘true’ in the probability space captured by the hypothesis 
competition. We illustrate queries in Chapter 


5.5. Design-Theoretic Properties 

For the U-intro procedure to be meaningful, we have to study design-theoretic 
properties of u-factor projections and prediction projections synthesized out of ‘big’ 
fact table Hk for the sake of predictive analytics. In particular, for the projections 
to be claim-centered, we submit that they should satisfy Boyce-Codd normal form 


(BCNF, cf. Def. 21) w.r.t. the repaired factorization Fj, of and for them to 
be a correct decomposition of the uncertainty present in Hk, their join should be 


lossless (preserve the data in Hk, cf. Def. 22) w.r.t. Fj.. 

Note that in this study we consider repaired factorization Fj, (not u-factoriza- 
tion Ffc), since it is the one which actually holds in after key repairing. 


5.5.1 Claim-Centered Decomposition 

As emphasized through Remark every fd in u-factorization F^ is a claim (cf. 
Remark]^, then the same holds for repaired factorization Fj.. Thus, for a claim- 
centered decomposition of ‘big’ fact table Hk, it is desirable that U-relational 
schema Yk that it satishes BCNF w.r.t. F'^. BCNF (‘do not represent the same 
fact twice’ [21], p. 251]) is our notion of ‘good design’ for uncertainty decomposition 
in view of predictive analytics. This is to avoid the uncertainty of one claim to be 
undesirably mixed with the uncertainty of another claim. 

Def. 20 Let R[U] be a relation scheme over set U of attributes, and S a set of 
fd’s. Then the projection of S onto R[U], written 7rf/(S), is the subset S' C S of 
fd’s Z such that XZ C U. 


Def. 21 Let R[U] be a relation scheme over set U of attributes, and S a set of fd’s 
on U. We say that: 

(a) R is in BCNF if, for all {X, A) G S’*" with X and XA'ZU, we have 
X ^ U (i.e., W is a superkey for R); 
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(b) A schema R is in BCNF if all of its schemes Ri,Rn € R are in BCNF. 


Example 7 To illustrate the concept of BCNF, let us consider canonical fd set 
S = {A—)- B, B ^ C } over attributes U = {A,B,C}, and a tentative schema 
containing a single relation R[ABC]. This relation is not in BCNF because, for 
one, B^ C violates it (C ^ i? but B is not a superkey for R). □ 

Observe also that an overdecomposed schema may (trivially) satisfy BCNF. 
For example, let S = {A —)■ 5, A —)■ C} then by Def. both schemas R = 
Ri[ABC] and R! = {Ri[AB], i?2[AC]} are in BCNF w.r.t. S. The second, how¬ 
ever, breaks data into two tables making their access more difficult than necessary 
since both B and C brings in information about A. That is, if the schema were to 
be synthesized over the fd’s in S, then it would be desirable to apply (R4) union or 
merge them before. Our point is that, if we target at a BCNF-satisfying schema, 
then it is also desirable for it to be the minimal-cardinality schema in BCNF. 

Theorem guarantees the BCNF property w.r.t. F), by design for every 
schema Y rendered by (Alg. synthesize over F^. 

Theorem 6 Let Sk and Hk be (resp.) the complete structure and ‘big’ fact table 
of hypothesis k, and let F). be the repaired factorization of Sk over Hk, and Yq the 
‘explanation’ table where hypothesis k is recorded. Now, let be a U-relational 
schema dehned Yk — synthesize(iSfc, 7/^, Iq)- Then is in BCNF w.r.t. F'^ and 
is minimal-cardinality. 


Proof 15 We exploit the fact that the projection of (Fj.)’*' onto u-factor projections 
and predictive projections dehne a disjoint partition of (F'^)+ into its 0-projection 
<h(F'^)^ and n-projection T(F'^)^. Since we know the form of fd’s in each of them, 
the search space for BCNF violations is signihcantly reduced. The minimality of 
\Yk\ in turn comes from (Alg. merge. See Appendix, ^A.3.1 □ 


5.5.2 Correctness of Uncertainty Decomposition 


Recall from the preliminaries (cf. ^5.1) that the U-relational equivalent of the 


relational product operation (main sub-operation of the join operation) has been 
introduced. Now, we provide the classical dehnition of a lossless join |20], i.e.. 
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when a decomposition of data from a relation into two or more relations is known 
to preserve the data in its original form by an application of the join. 

Def. 22 Let R\U] be a (U-)relational schema synthesized into collection R = 
Ur=i ^ attribntes U. We say that R has a lossless join 

w.r.t. S if for every instance r of R[U] satisfying E, we have r = ttr. (r). 

The lossless join property is of interest to ensure that our decomposition of 
the data from the ‘big’ fact table into u-factor projections preserves the data so 
that their join to ‘annotate’ the predictive projections when propagated by means 
of the U-relational join operation is correct. Theorem [^guarantees that is the case. 

Theorem 7 Let Sk be the complete structure of hypothesis k, and Hk[U] its ‘big’ 
fact table such that T'^ is the repaired factorization of over and Yq is the 
‘explanation’ table where hypothesis k is recorded. Now, let Yk he & U-relational 
schema defined Yk — synthesize(5fc, iLfc, To). Then, 

(a) the join cxi™ ^ Y^[ViDi\(j) Ai Gi ] of any subset of the u-factor projections 
of Hk is lossless w.r.t. Tj,. 

(b) any predictive projection Tjf [Vj Dj \ S T], result of a join of the theoretical 
u-factor Yq[VoDq\(1)v] with the ‘big’ fact table Hk\U] and in turn with 
u-factor projections Yj! [ Vj A | 0 A G* ], is lossless w.r.t. T).. 


Proof 16 We make use of a lemma from Ullman 
comes straightforwardly. See Appendix, §A.3.2 


p. 397], and then the proof 


□ 


Remark 8 The significance of Theorem j^lies in that it guarantees the decomposi¬ 
tion of uncertainty based on the causal ordering processing is in fact claim-centered 
as desirable for predictive analytics. Theorem in turn is signihcant as it ensures 
that all the empirical uncertainty implicit in a hypothesis ‘big’ fact table Hk can 
be decomposed into u-factor projections that are (a) independent (not strongly 
correlated, cf. Problem]^, and (b) can be fully recovered by a join that is lossless 
w.r.t. repaired factorization T'^ of structure Sk over Hk. This is essential to make 
sure that in u-propagation the composition of the required u-factors recovers the 
uncertainty associated with the predictive data. Since repaired factorization Tj, is 
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known to be a correct processing of the causal ordering (cf. results of Chapter]^, 
altogether Theorem guarantees that the hrst causes are joined together correctly 
towards the predictive variables influenced by them. □ 

As we have seen, the p-DB synthesis technique presented here is essentially 
targeted at design-theoretic properties. It is also motivated by computational 
performance, as uncertainty decomposition is desirable also to speed up proba¬ 
bilistic inference [T71 p. 30-1]. In fact, the U-intro procedure is fully grounded 
in U-relations and p-WSA as implemented in the MayBMS system. Its compu¬ 
tational performance is dominated by U-relational query processing. We present 
experimental studies on the U-intro procedure in §6.4| , as they are designed from a 
applicability point of view. The goal is to provide some reference computational 
measures for prospective users. 

5.6. Related Work 

Informed on research on Graphical Models (GM) |69], Suciu et al. provide a 
striking motivation for work on probabilistic database design [T^ p.30-1]. In GM 
design, probability distributions on large sets of random variables are decomposed 
into factors of simpler probability functions, over small sets of these variables. The 
factors can be identihed, e.g., by using a set of axioms (the so-called ‘graphoids’) for 
reasoning about the probabilistic independence of variables cni. The same design 
principle (sic.) applies to p-DB’s HH: the data should be decomposed into its sim¬ 
plest components so that only key constraints hold in a table (i.e., it is in BGNF). 
Attribute- and tuple-level correlations should guide the table decomposition into 
simpler tables. Ideally, the original table with its probability distribution can be 
recovered as a query (a view) from the decomposed tables. We have followed such 
principle in our claim-centered decomposition for predictive analytics. 

In fact, a connection between database normalization theory and factor de¬ 
composition in Graphical Models (GM) has been discussed by Verma and Pearl 
|7nj . but has not been explored since then. To date, there is no formal design theory 
for p-DB’s [1^. A step in that direction is taken by Sarma et ah [^. Their initia¬ 
tive revisits dependency theory in view of reformulating fd’s for uncertain schema 
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design m- Our work takes a different direction. We refer to classical dependency 
theory and U-relational operations (viz, its uncertainty-introduction operator) to 
construct p-DB’s systematically from scratch. We have focused on the extraction 
and processing of fd’s towards a factorized U-relational schema. The synthesized 
schema is ensured to be in BCNF and have a lossless join. 

Despite some major differences, our synthesis method builds upon the classi¬ 
cal theory of relational schema design by synthesis |62] . Classical design by synthe¬ 
sis |62] was once criticized due to its too strong ‘uniqueness’ of fd’s assumption m 
p. 443], as it reduces the problem of design to symbolic reasoning on fd’s, arguably 
neglecting semantic issues. Probabilistic design, however, has roots in statistical 
design so that the problem is less amenable to human factors. As we extract the 
dependencies from a formal specihcation, design by synthesis is doing nothing but 
translating seamlessly (into fd’s) the reduction made by the user herself in her 
tentative model for the studied phenomenon. 

The last decade has seen signihcant research effort to make DB systems really 
usable [73]. Our design-by-synthesis framework can also be understood as a tech¬ 
nique for user-friendly p-DB design. For instance, in comparison, the CRIUS sys¬ 
tem supports another kind of user-friendly DB design approach that provides users 
with a spreadsheet-like direct manipulation interface to increasingly add structure 
to their data m- Our dependency extraction and processing, instead, completely 
alleviates the user from the burden of data organization. 

Also related to probabilistic DB design is the topic of conditioning a p-DB. 
It has been hrstly addressed by Koch and Olteanu motivated by data cleaning 
applications |7g. They have introduced the assert operation to implement, as in 
AI, a kind of knowledge compilation, viz., world elimination in face of constraints 
(e.g., FDs). For hypothesis management, nonetheless, we need to apply Bayes’ 
conditioning by asserting observed data, not constraints. In §2.5| we have presented 
an example that settles the kind of conditioning problem that is relevant to the 
T-DB vision. In Chapter we present realistic use cases. We have addressed the 
problem at application level only in order to complete the realization of the vision 
in a real prototype system. The formulation of Bayes’ conditioning as an extension 


of, say, the U-relational data model is open to future work (cf. ^7.3). 
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5.7. Summary of Results 


In this chapter we have studied and developed our end-purpose technique for 
the synthesis of a probabilistic DB geared for predictive analytics. It completes 


the pipeline (Fig. 1.4) so that conditioning can then be performed iteratively. 


• Algorithm [^synthesize gives a general formulation of how to perform un¬ 
certainty introduction from causal dependencies given in the form of fd’s. 

• By Problem we have given a definition of uncertainty factor learning 
from data available in a given relation. 

• Remark provides an example of the u-factor and predictive projections 
resulting from p-DB synthesis and their corresponding probability distri¬ 
butions stored in the world table; 

• By Remark and Theorem we have shown that U-relational schema Yk 
synthesized over the fd’s processed by causal reasoning is in BCNF. That 
is, it is in fact a claim-centered decomposition as desirable for predictive 
analytics. 

• Theorem [^ensures that such (uncertainty) decomposition is correct, as the 
original (probability distribution) ‘big’ fact table is fully recoverable by a 
lossless join. 




Chapter 6 


Applicability 


In this chapter we show the applicability of T-DB in real-world scenarios. 
We present use cases in Computational Physiology extracted from the Physiome 
projectIn 16.1 we introduce the Physiome project as providing a testbed for 
T-DB. Then in 16.2 we go through some Physiome case studies to show the con¬ 


struction of T-DB and its application for data-driven hypothesis management and 


analytics. In 16.3 we present a prototype of the T-DB system and demonstrate it 
through the running example introduced in §5.2[ In §6.4|we present experiments on 


Physiome hypotheses. In §6.5| we provide a general discussion on the applicability 
of T-DB, its assumptions and scope. In §6.6| we conclude the chapter. 


6.1. The Physiome Project as a Testbed 

The Physiome project is an initiative to seriously address the problems of repro¬ 
ducibility, model integration and sharing in Computational Physiology [HU [33]. It 
essentially comprises: 


• a curated repository of 380-1- computational physiology models available 
online for researchers]^ 


• the Mathematical Modeling Language (MML) to allow models to be written 
in declarative form and then exported into a number of XML-compliant 

^ http://physiome.org, 

^ The Physiome model repository is expanded to over liK-\- models by including models 
extracted from other sources (such as the EBML-EBI BioModels DB, the CellML Archive, and 
the Kegg Pathways DB) and converted to MML automatically. 
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interoperable formats]^ 

• a problem-solving environment called JSim to allow researchers to code 
their MML models straightforwardly, run them under different parameter 
and solver settings and build customized data plots to see the results. 

From the point of view of T-DB, Physiome is an external data source that 
provides a very interesting testbed with realistic scenarios. We extract Physiome 
models into T-DB by means of a wrapper we have implemented to read XMML hies 
(JSim’s XML encoding of MML models). Simulation trial datasets are rendered by 
a parametrized UNIX script we have developed to invoke JSim automatically (in 
batch mode). Currently, T-DB’s Physiome wrapper is designed to read MAT hies 
to load both the model input (parameter settings data) and its associated model 
output (computed predictive data) for each simulation trial. 

Physiome does not keep records of phenomena in a repository, but it does 
have observational data attached to some of the entries of the model repository. 
Such models appear in the hlter ‘models with data,’ meaning that they have one 
or more observational datasets and plots showing how the model data hts to ob¬ 
servations. We shall make use of model entries containing observational data in 
the realistic scenarios presented in this paper. 

6.2. Case Studies 

In this section we present use cases extracted from the Physiome model repository]^ 


6.2.1 Case: Hemoglobin Oxygen Saturation 

In this case we stress the potential of data-driven hypothesis analytics in com¬ 
parison to handcrafted curve htting (visual) analysis. We study three different 
hypotheses that perform “closely” visually when compared to their target phe¬ 
nomenon dataset, see Fig. 6.1[ All of them have been empirically set as £t as 
possible to the observations (‘Rlsl’ dataset) in their local view (in separate), and 
are now compared together in a global view. 


http://www.physiome.org/j sim/docs/MML_Intro.html 
http://www.physiome.org/Models/modelDB/, 
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Figure 6.1. Plot of hemoglobin oxygen saturation hypotheses (SHb02.{Ad, H, 
D} curves) and their target observations (‘Rlsl’ dataset), (source: Physiome). 


Example 8 The resources of this example are shown in Fig. 6^ We consider 
the Physiome model entries described in relation HYPOTHESIS, associated to the 
phenomenon described in relation PHENOMENON (cf. explanation relation Hq). 
One single hypothesis trial (its best £t) is considered for each hypothesis. □ 


HYPOTHESIS 

V 

Name 

Description 


28 

HbO.Hill 

Hill Equation for 02 binding to hemoglobin. 


31 

HbO.Adair 

Hemoglobin 02 saturation curve using Adair’s 4-site 
equation. 


32 

HbO.Dash 

Hemoglobin 02 saturation curve at varied levels of 
PC02 and pH. 


PHENOMENON 

(t> 

Description 

b 

V 


1 

Hemoglobin oxygen saturation with observational 

i 

28 



dataset from Sevenringhaus 1979. 

1 

31 




1 

32 


Figure 6.2. Descriptive (textual) data of Example 8l with ids v from Physiome’s 
model repository (http://www.physiome.org/Models/modelDB/). 
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Encoding. The fd encoding of hypotheses v G {28,31,32} is shown (resp.) in 


Fig. 6.3, Fig. 6.4 and Fig. 6.5 


S 28 = { K02 n p02a ^ SHb02_H, 

n p50 V K02, 

(j) ^ n p50 p02_delta p02_max p02_min }. 


Figure 6.3. Fd set S28 of hypothesis v = 28. 


S 31 — { 

A1 A2 A3 A4 p02v 


SHb02_Ad, 


p50 V 

— >• 

A1 A2 A4, 


A3 p50 p02_delta p02_ 

max 

p02_min }. 


Figure 6.4. Fd set S31 of hypothesis t; = 31. 


S 32 = { K02 p02 V SHb02_D, 

terml term2 termS v —?> K02, 

HpRBC K6p alpha02 v —^ terml, 
HpRBC K3 K5 alphaC02 pC02 v —^ term2, 
HpRBC K2 K4 alphaC02 pC02 v —^ termS, 

pH a ^ HpRBC, 

^ K2 K3 K4 K5 K6p alphaC02 alpha02 pC02 pH p02_delta, p02_max p02_min }. 


Figure 6.5. Fd set S32 of hypothesis t; = 32. 


Symbol Mappings. As we have seen, the insertion of hypothesis trial datasets 
requires users to specify a target phenomenon and the corresponding mappings 
from the hypothesis symbols to the target phenomenon symbols. In this use case, 
we have: 

• M28^i = { p02 ^ p02, SHb02_H ^ SHb02 }; 

• = { p02 ^ p02, SHb02_Ad ^ SHb02 }; 

• M32^i = { p02 ^ p02, SHb02_D ^ SHb02 }; 
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Hypothesis Management. Query Q1 illustrates the feature of hypothesis man¬ 
agement for this case. We consider the user is interested in all SHb02 predictions 


over a subset of the p02 domain. Its result set is shown in Fig. 6.6 


Ql. (select phi, upsilon, tid, “p02”, “SHb02_H” as SHb02 from Y28_claiml 
where phi=l and “pO2">=20 and '‘pO2”<=40) union all 
(select phi, upsilon, tid, '‘p02”, “SHb02_Ad" as SHb02 from Y31_claiml 
where phi=l and “pO2”>=20 and '‘pO2”<=40) union all 
(select phi, upsilon, tid, “p02”, “SHb02_D” as SHb02 from Y32_claiml 
where phi=l and '‘pO2”>=20 and '‘pO2”<=40) order by '‘p02", upsilon, tid; 


Ql 

d 

V 

tid 

p02 

SHb02 


1 

28 

1 

20 

0.329956122828398 


1 

31 

1 

20 

0.294443723056007 


1 

32 

1 

20 

0.334165672301096 








1 

28 

1 

40 

0.761898061367189 


1 

31 

1 

40 

0.823463829100424 


1 

32 

1 

40 

0.759042287799556 


Figure 6.6. Result set of hypothesis management query Ql. 


Hypothesis Analytics. Fig. |6.7| shows the results of analytics after conditon- 
ing the probability distribution in the presence of observations (‘Rlsl’ dataset). 
The fact that hypothesis u = 31 provides the best explanation for the studied 
phenomenon is enabled by the application of Bayesian inference as implemented 
within the T-DB system. The contribution of the T-DB methodology is to equip 
users with a tool for large-scale, data-driven hypothesis management and analytics. 


PHLCONF 

d 

V 

tid 

p02 

SHb02 

Prior 

Posterior 


1 

28 

1 

1 

0.000151184162020125 

.333 

.326 


1 

31 

1 

1 

0.003789100566457180 

.333 

.349 


1 

32 

1 

1 

0.000178973375779681 

.333 

.325 










1 

28 

1 

100 

0.974346796798538 

.333 

.326 


1 

31 

1 

100 

0.990781330988763 

.333 

.349 


1 

32 

1 

100 

0.972764121981342 

.333 

.325 


Figure 6.7. Results of analytical study on the hemoglobin phenomenon. 
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6.2.2 Case: Baroreflex Dysfunction in Dahl SS Rat 

This case is extracted from the Virtual Physiological Rat project]^ Here we show 
the potential of data-driven hypothesis management and analytics for model tun¬ 


ing. Fig. |6. 8 1 shows the best fit of a baroreflex model for an observational dataset 
acquired by experiment on Dahl SS rat ug. We in turn use T-DB to carry out 
such hypothesis management and analytics. We generate by a parameter sweep 
script IK trials and insert them into the database. A best £t is then selected 
automatically by Bayesian inference. 



Figure 6.8. Plot of baroreflex hypothesis (‘HR’) for Dahl SS Rat and its target 
observations (‘data’), (source: 1751 ). 


Example 9 The resources of this example are shown in Fig. |6.9[ We consider the 
single hypothesis entry described in relation HYPOTHESIS, and the phenomenon 
described in relation PHENOMENON. By parameter sweep, IK trials are inserted 
into T-DB for management and analytics. □ 


HYPOTHESIS 

V 

Name 

Description 


1001 

Baroreflex_SB_CT 

Physiological model of the full baroreflex heart control 
system based on experimental measurements. 


PHENOMENON 

(!> 

Description 

Hn 

6 

V 


2 

Parorehex dysfunction in Dahl SS Kat. 

2 

1001 


Figure 6.9. Descriptive (textual) data of Example 
Encoding. The fd encoding of hypothesis v G {1001} is shown in Fig. 6.12 


http://virtualrat.org/computational-models/. 
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Symbol Mappings. We consider that the user provides symbol mappings: 


• -^iooiiH -2 = { Time i->- Time, HR i-)- HR }; 


Hypothesis Management. In query Q2 we consider that the user is interested in 


time instants where the heart rate is higher than a threshold, say, 300 beats/min. 


The result set is shown in Fig. 6.10 


Q2. select phi, upsilon, tid, “Time”, “HR" from Y1001_claiml 
where phi=2 and “HR”>=300 order by “Time”, tid; 


Q2 

0 

V 

tid 

Time 

HR 


2 

1001 

1 

0.61 

300.013659905941 


2 

1001 

2 

0.61 

300.011268345391 


2 

1001 

96 

0.61 

300.001934440349 


2 

1001 

1 

0.62 

300.607671377207 


Figure 6.10. Result set of hypothesis management query Q2. 


Hypothesis Analytics. Fig. 6.11| shows the results of analytics on phenomenon 
(j) = 2 after conditioning the probability distribution in the presence of observations 
(‘SSBN9_HR’ dataset). Since this case deals with model tuning, viz., IK slightly 
different parameter settings, the trial ranking is decided by small differences in the 


posterior probability distribution (cf. Fig 6.11). 


PH2 CONF 

</> 

V 

tid 

Time 

HR 

Prior 

Posterior 


2 

1001 

491 

0.4 

286.556506432110 

.001000 

.00103159 


2 

1001 

591 

0.4 

286.525209765226 

.001000 

.00103144 


2 

1001 

492 

0.4 

286.555565558231 

.001000 

.00103023 


2 

1001 

491 

118.9 

421.251853783050 

.001000 

.00103159 


2 

1001 

591 

118.9 

421.110425308905 

.001000 

.00103144 


2 

1001 

492 

118.9 

421.297317710657 

.001000 

.00103023 


Figure 6.11. Results of analytical study on the baroreflex phenomenon. 


6.2.3 Case: Myogenic Behavior of a Blood Vessel 

Computational models of physiology may account for diverse effects that take 
place at different levels of biological organization from the organ to the cellular 
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^1001 = { hr V Period, 

Beta HR_p HR_s HRmin HRo v HR, 

HRo delta_HR_pT; HR_p, 

delta_HR_pfast delta_HR_pslow i; delta_HR_p, 

HRo delta_HR_sT; ^ HR_s, 

Time delta_HR_s_Time_min delta_HR_ss tau_HR_nort; delta_HR_s, 

Gamma delta_HR_ps i; —)• delta_HR_pfast, 

Gamma Time delta_HR_ps delta_HR_pslow_Time_min tau_HR_ach v delta_HR_pslow, 

K_nor c_nor delta_HR_smax delta_HR_ss, 

C_ach K_ach delta_HR_pmax u delta_HR_ps, 

Time Ts c_nor_Time_min q_nor, tau_nor i? c_nor, 

C_ach_Time_min Time Tp q_ach tau_ach v C_ach, 

Gs Tsmax Tsmin alpha_cns alpha_s0 Ts, 

Gp Tpmax Tpmin alpha_cns alpha_p0 i; Tp, 

Gens n V alpha_cns, 

S Zeta delta delta_th i; ^ n, 

Eps_l Eps_wall V delta, 

B1 Eps_l_Time_min Eps_2 Eps_2_Time_min Eps_wall K1 Kne Time v Eps_l, 

B1 B2 Eps_l Eps_l_Time_min Eps_2_Time_min Eps_3 Eps_3_Time_min K1 K2 Time v Eps_2, 

B2 B3 Eps_2 Eps_2_Time_min Eps_3_Time_min K2 K3 Time v Eps_3, 

R RO Eps_wall, 

At; ^ R, 

A_Time_min Bwall Cwall P RO Time t; ^ A, 

HRmax HRot; delta_HR_smax, 

HRmin HRo v —> delta_HR_pmax, 
Time (j) data, 

(p —> A_Time_min B1 B2 B3 Beta Bwall C_ach_Time_min Cwall Eps_l_Time_min 

Eps_2_Time_min Eps_3_Time_min Gamma Gens Gp Gs HRmax HRmin HRo K1 K2 K3 

K_aeh K_nor Kne P RO S Time_delta Time_max Time_min Tpmax Tpmin Tsmax Tsmin 

Zeta alpha_p0 alpha_s0 e_nor_Time_min delta_HR_pslow_Time_min delta_HR_s_Time_min 

delta_th q_aeh q_nor tau_HR_aeh tau_HR_nor tau_aeh tau_nor }. 


Figure 6.12. Fd set Eiooi of hypothesis i; = 1001. 


and molecular levels [33]. Typically, a sophisticate model is developed incremen¬ 
tally by, say, adding detail into some previously existing model or extending its 
dimensionality (e.g., extending it from a stationary to a dynamic account of phe¬ 


nomena). In this case study (cf. Example 10) we consider alternative models of 


the myogenic behavior of a reference human blood vessel. 





Vessel Diameter, 
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Figure 6.13. Plot of myogenic behavior hypothesis (‘D’) according to i; = 89, 
trial tid=2, and its target observations (‘Diameter’). 


Example 10 (See Fig. 6.14). We consider the Physiome model entries displayed 
in relation HYPOTHESIS, and two phenomena (see relation PHENOMENON). One trial 
is considered for hypothesis v = 60, and two for hypothesis v = 89. □ 


HYPOTHESIS 

V 

Name 

Description 


60 

Myogenic_Com pliant 
.Vessel 

This model simulates the flow through a passive 
and actively responding vessel driven by a sinu¬ 
soidal pressure input. 


89 

Myo_Dyn_Resp_wFit 

This model describes the dynamic response of a 
vessel after a step increase in intraluminal pressure. 


PHENOMENON 

</> 

Description 

d> 

V 


3 

Dynamics of vessel diameter in response to pul- 

3 

60 



satile intraluminal pressure. 

3 

89 


Figure 6.14. Descriptive (textual) data of Example 


10 


Encoding. The fd encoding of hypotheses v G {60, 89} is shown (resp.) in Fig. 


6.15 and Fig. 6.16 


Symbol Mappings. We consider that the user provides symbol mappings: 

• AleoiH-i = { t !-)■ Time, D i-G- Diameter }; 

• AlsgiH-i = { t !-)■ Time, D i-G- Diameter }; 
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Ego = { Fcomp Fout v 

—> Fin, 

Pin Pout R V 

Fout, 

V V t_min t n —> 

Fcomp, 

D L mu V 

—> R, 

D Lv 

^ V, 

D Pin T V 

—>■ A, 

A Cla Clp C2a C2p C3a DplOO Pin T Ttargetu 

Atarget, 

Atarget Cglobal Cmyo Pin v 

—)■ D, 

D D_t_min Dc Tc Ttarget t taud v 

^ T, 

A A t_min Atarget t taua v —>■ 

Ttarget, 

Ac a —>■ 

A_t_min, 

Cglobal Cmyo Tea 

—y Ac, 

Dc a —>■ 

D_t_min, 

Dc Pmean a 

—y T c, 

Pamp Pmean t tnorm a 

Pin, 

Cla Clp C2a C2p C3a Cglobal Cmyo DplOO Pc a 

^ Dc, 

(j) —>■ Cla Clp C2a C2p C3a Cglobal Cmyo DplOO L Pam 

3 Pc Pext 

Pmean Pout V_t_min mu t_delta t_max t_min taua taud 

tnorm }. 


Figure 6.15. Fd set Ego of hypothesis i; = 60. 


Egg = { D P T a 

—A, 

A Cla Clp C2a C2p C3a DplOO P T Ttarget a ^ 

Atarget, 

Atarget Cglobal Cmyo P a 

—)■ D, 

D D_t_min Dc Tc Ttarget t taud a 

^ T, 

A A t_min Atarget t taua a — >■ 

Ttarget, 

Ac a -> 

A_t_m i n, 

Dc Pc a 

—y T c, 

Cglobal Cmyo Dc Pc a 

—y Ac, 

Dc a — > 

D_t_min. 

DelP Pc a 

-> P. 

Cla Clp C2a C2p C3a Cglobal Cmyo DplOO Pc a 

—y Dc, 

t (j) -> DelP, 

(p —> Cla Clp C2a C2p C3a Cglobal Cmyo DplOO Pc t_delta t_max t_min taua taud }. 


Figure 6.16. Fd set Egg of hypothesis i; = 89. 

Hypothesis Management. Query Q3 illustrates the feature of hypothesis man¬ 
agement for this case. The user selects all diameter predictions within the time 


interval t G [100,300] (cf. plot in Fig. 6.13). Its result set is shown in Fig. 6.17 
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Q3. select phi, upsilon, tid, “t”, “D” from Y60_claiml 

where phi=3 and '‘t”>=100 and “t”<=300 union all 

select phi, upsilon, tid, "t”, "D" from Y89_claiml 

where phi=3 and '‘t”>=100 and “t"<=300 order by “t”, upsilon, tid; 


Q3 

0 

V 

tid 

t 

D 


3 

89 

1 

100.00 

194.622865847211 


3 

89 

1 

100.00 

97.3787340059609 


3 

89 

2 

100.00 

126.167727083098 


3 

89 

1 

100.01 

194.626017703936 


3 

89 

1 

100.01 

98.0174705905828 


3 

89 

2 

100.01 

126.161751822302 


Figure 6.17. Result set of hypothesis management query Q3. 


Hypothesis Analytics. Fig. 6 .181 shows the results of analytics on phenomenon 


0 = 3 after conditoning the probability distribution in the presence of observations, 


viz., ‘Davis_Sikes_Fig3_Myo_DigData’ dataset. 


PH3 CONF 

0 

V 

tid 

Time 

Diameter 

Prior 

Posterior 


3 

60 

1 

14.8 

194.996792066637 

.5 

.000 


3 

89 

1 

14.8 

97.0568250956827 

.25 

.269 


3 

89 

2 

14.8 

116.327813203282 

.25 

.731 










3 

60 

1 

30.5 

195.684170988267 

.5 

.000 


3 

89 

1 

30.5 

97.0568250767575 

.25 

.269 


3 

89 

2 

30.5 

116.327813337087 

.25 

.731 


Figure 6.18. Results of analytics on the vessel’s myogenic behavior phenomenon. 


In this case study two tentative models have been considered under a uniform prior 
probability distribution which has been updated to a posterior distribution. Note 
that, even though hypothesis u = 60 has its probability weight concentrated in a 
single trial, the Bayesian inference is able to indicate u = 89 as the best explanation 
for 0 = 3 and tid = 2, in particular, its best £t. 

6.3. System Prototype 

A hrst prototype of the T-DB system has been implemented as a Java web applica¬ 
tion, with the pipeline component in the server side on top of MayBMS (a backend 
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extension of PostgreSQL). We have developed a demonstration of this prototype 


(cf. [2H]), in which we go throngh the whole design-by-synthesis pipeline (Fig. 1.4) 
exploring nse case scenarios. In this section we provide a brief demonstration of the 
system in the popnlation dynamics scenario previonsly introdnced in this thesis. 

The demonstration nnfolds in three phases. In the hrst phase, we show 
the ETL process to give a sense of what the nser has to do in terms of simple 
phenomena description, hypothesis naming and hie npload to get her phenomena 
and hypotheses available in the system to be managed as data. In the second phase, 
we reprodnce some typical qneries of hypothesis management (like those shown 
in the previous section). In the third phase, we enter the hypothesis analytics 
module. The user chooses a phenomenon for a hypothesis evaluation study, and 
the system lists all the predictions with their probabilities under some selectivity 
criteria (e.g., population at year 1920). The predictions are ranked according to 
their probabilities, which are conditioned on the observational data available for 
the chosen phenomenon. 


6.3.1 Demo Screenshots 

Fig. 6.21 shows screenshots of the system. Fig. 6.21[ a) shows the research 
projects currently available for a user. Figs. |6.2T| (b, c) show the ETL interfaces for 
phenomenon and hypothesis data dehnition (by synthesis), and then the insertion 
of hypothesis trial datasets, i.e., explanations of a hypothesis towards a target 
phenomenon. Fig. 6.21[d) shows the interface for basic hypothesis management by 


listing the predictions of a given simulation trial. Figs. 6.21 e, f) show two tabs of 


the hypothesis analytics module, viz., selection of observations and then viewing 
the corresponding alternative predictions ranked by their conditioned probabilities. 


6.3.2 Demo Case: Population Dynamics 

In this case we refer to a well-known problem in Computational Science, viz., 
population dynamics scenarios, to demonstrate the T-DB system prototype. Fig. 
6.19 shows census data collected from in the US from 1790 to 1990|^ Fig. 6.20 
shows observational data collected from Hudson’s Bay from 1900 to 1920 on the 
Lynx-Hare population UH. 


Cf. https://www.census.gov/population/censusdata/table-4.pdf 
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US Population Census I - 2: PoDuiation~~l 



Figure 6.19. Census US population from 1790 to 1990. 


Lynx-Hare in Hudson's Bay 


- 3: Lynx 

I - 4: Hare I 



1900 


1905 


31Dec69. 21:00 


1910 1915 

Year 


1920 


Figure 6.20. Lynx-Hare population observed in Hudon’s Bay from 1900 to 1920. 
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Phenomenon data definition 

V 

Phenomenon Id 

h 1 

Research 

Population dynamics ^ 



X-DB 

data-driven hypothesis management and analytics 

ly Vour current research projects are: 

Description 

Lynx population in Hudson's Bay, Canada, from 1900 to 1920. 

Upload dataset (CSV format) 

% 0 

I Qioos^ij^^^nj^Hare.csv 

Loading observations 


Population dynamics Physiological rat Hemodynamics 

Observable 

Year 


Lynx 





(a) Research dashboard after login. (b) Phenomenon data definition. 


Hyphotesis data definition 
Hypothesis id 


Lotka-Volterra model 


Upload structure (XML (brmat) 

I Choose Fiie~| Lotka.Volterra.xml 


Phenomenon 

I LynxpopulatloninHudson'sBay, Canada, from 1900 to 1920. 


Map symbols 


Variable 

Observable 



!• -I 

|ve. .| 





I * ’I 

Lynx » 




Hypothesis trial datasets (MAT format) 
\ Choose Files 1 10 flies 


I Hypothesis management 


Lynx population In Hudson's Bay, Canada, from is 


31.1083920070696 


75.9196961932191 


34.1895828779035 


74.4878105315043 


37.4356490187431 


72.6675156977604 


40.8008312965646 


70.4705874385644 


44.2262696820135 


67.9230251780413 


47.6417755935588 


65.0649920502872 


50.9691309633260 


61.9493743700121 


54.1267031673551 


58.6389813583626 


(c) Hypothesis data definition. 


(d) Hypothesis management. 


Hypothesis Analytics 


Phenomenon 

Lynx population In Hudson's Bay. Canada, from 1900 to 1920. 


Observations I Predictions 





Year 

GO 

Lynx 

CO 

Hare 

CD 

OD 

1900 

30 

4 

OD 

1901 

47.2 

6.1 

OD 

1902 

70.2 

9.8 

OD 

1903 

77.4 

35.2 

OD 

1904 

36.3 

59.4 

OD 

1905 

20.6 

41.7 

OD 

1906 

18.1 

19 

OD 

1907 

21.4 

13 

OD 

1908 

22 

8.3 


Hypothesis Analytics 


Phenomenon 

I Lynx population In Hudson's Bay. Canada, from 1900 to 1920. 



upsilon 

tid 

Year 

Lynx 

conf 

3 

2 

1904 

65.060410460081 

0.183505 

3 

6 

1904 

75.919696193219 

0.179993 

3 

4 

1904 

77.459735769215 

0.175992 

3 1 

1904 

89.592307430943 

0.131452 

3 

5 

1904 

88.321831841064 

0.127000 

3 

3 

1904 

90.083803232660 

0.124023 

1 1 

1904 

16.487212706992 

0.047211 

2 

2 

1904 

77.822475573932 

0.017372 

2 1 

1904 

79.812581025093 

0.013234 

1 2 

1904 

18.221188003898 

0.000220 


(e) Analytics: selected observations tab. (f) Analytics: ranked predictions tab 
Figure 6.21. Screenshots of this first prototype of the T-DB system. 
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Example 11 (See Fig. 6.22). We consider the model entries displayed in relation 
HYPOTHESIS, and two phenomena (see relation PHENOMENON). For 0 = 1, three 
trials are considered for hypothesis v = 1 and six for hypothesis v = 2. For 0 = 2, 
in tnrn, two trials are considered for hypothesis v = 1 and n = 2, and six trials for 


hypothesis v = 3. Note the data dehnition interfaces in Figs. 6.21 "b, c). □ 


HYPOTHESIS 

V 

Name 

Description 


1 

Malthusian growth 
model 

Exponential growth model ‘growth in population 
is proportional to its size’ is considered the first 
principle of population dynamics. 


2 

Logistic equation 

This model introduces growth saturation to the 
Malthusian model due to the limitation of resources. 


3 

Lotka-Volterra 

model 

This model describes predator-prey interactions. 


PHENOMENON 

0 

Description 


1 

US population from 1790 to 1990. 


2 

Lynx population in Hudson’s Bay, Canada, from 



1900 to 1920. 



0 

V 


1 

1 


1 

2 


2 

1 


2 

2 


2 

3 


Figure 6.22. Descriptive (textnal) data of Example 11 


Encoding. The fd encoding of hypotheses v G {1, 2, 3} is shown (resp.) in Fig. 
6.23, Fig. 6.24 and Fig. 6.25 See hypothesis strnctnre processing in Fig. 6.21[c). 


El = { b t X t_min —>■ x, 

0 —>■ b t_delta t_max t_min x t_min }. 


Figure 6.23. Fd set El of hypothesis r; = 1. 


S 2 = { K b t X t_min i; —>■ x, 

0 —Kb t_delta t_max t_min x_t_min }. 


Figure 6.24. Fd set E 2 of hypothesis v = 2. 


Symbol Mappings. We consider that the nser provides the following symbol 
mappings for (resp.) phenomena 0=1 and 0 = 2, see the interface for mapping 
symbols in Fig. 6.21[c). 
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S 3 = { bptyu —>• 

X, 

d r t X u —>• 

y, 

0 —?► b d p r t_delta t_max t_min x t_min y t_min 

}• 


Figure 6.25. Fd set S 3 of hypothesis v = 3. 

• = { t !-)■ Year, x i-)- Population }; 

• M. 2^1 = { t !-)■ Year, x i-)- Population }; 

• Mi ^2 = { t !-)■ Year, x i-)- Lynx }; 

• M 2^2 = { t !-)■ Year, x i-)- Lynx }; 

• M ^^2 = { t !-)■ Year, x i-)- Lynx }; 

Hypothesis Management. Query Q4 illustrates the feature of hypothesis man¬ 
agement for this case. The user selects hypothesis u = 3 (the Lotka-Volterra model), 
and hlters its available data for trial tid = 6 on phenomenon 0 = 2. Both the form- 
based query set-up and its result set are shown in Fig. 6.21[d). 


Q4. select "t”, "y”, "x” from Y3_claiml 

where upsilon=3 and phi=2 and tid=6 order by "t”; 


Hypothesis Analytics. Fig. 6.26 and Fig. 6.27 show the results of analytics on 
(resp.) phenomena 0 = 1 and 0 = 2 after conditoning the probability distribution 
in the presence of (resp.) observational datasets ‘US-census’ and ‘Lynx-Hare.’ In 
the hrst one, the user verihes that hypothesis v = 1 (the Malthusian model) is 
unlikely to be competitive with hypothesis v = 2 (the Logistic equation) as an 
approximation of the US population dynamics from 1790 to 1990. That is, if the 
user knows her current trials are reasonable, then more trials on the Malthusian 
model hardly could outperform trials on the Logistic equation for the studied 
phenomenon. 
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PH1 C0NF 


V 


tid 


Year 


Population 


Prior 


Posterior 


1 

1 

1 

1 

1 

1 


1 

1 

2 

2 

2 

2 


1 

2 

1 

2 

3 

4 


1920 

1920 

1920 

1920 

1920 

1920 


194.102222534948 

140.244165184248 

82.3031951115155 

108.251924734215 

105.918217777077 

105.988231944275 


.250 

.250 

.125 

.125 

.125 

.125 


.000000 

.000000 

.133038 

.239684 

.290026 

.337251 


Figure 6.26. Results of analytics on the US population phenomenon. 


PH2 CONF 


tid 


Year 

Lynx 

Prior 

Posterior 





1904 

16.49 

.167 

.047 

1904 

18.22 

.167 

.000 

1904 

79.81 

.167 

.013 

1904 

77.82 

.167 

.017 

1904 

89.59 

.055 

.131 

1904 

65.06 

.055 

.184 

1904 

90.08 

.055 

.124 

1904 

77.46 

.055 

.176 

1904 

88.32 

.055 

.127 

1904 

75.92 

.055 

.180 






Figure 6.27. Results of analytics on the Hudson’s Bay lynx population pheno¬ 
menon, see interfaces in Figs. 6.21 |^e, f). 


6.4. Experiments 

The efficiency and scalability of the U-relational representation system and 
its p-WSA query algebra have been extensively demonstrated [27]. T-DB’s, as 
U-relational hypothesis DB’s, must therefore be as efficient and scalable as any 
arbitrary U-relational DB. 


In these experiments (see Fig. 6.28) we provide some measures of perfor¬ 
mance of the method of T-DB in the particular context of our real-world Phys- 
iome testbed. Our purpose here is to provide a concrete feel on how efficient the 
T-DB methodology can be. However, most of these tests (the four graphs on the 


bottom in Fig. 6.28) involve the data level and then require more of the hardware. 

Our current experimental setup (personal computer)]^ allows us to reach a scale 

^ These experiments were performed on a 2.3GHZ/4GB Intel Core i5 running Mac OS X 
10.6.8 and MayBMS (a PostgreSQL 8.3.3 extension). 
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Figure 6.28. Performance behavior of T-DB on a Physiome testbed scenario. 


in which the uncertain data being processed in synthesis ‘4U’ is sized up to 1 GB. 
For the two hrst graphs {XML extraction and encoding), we have collected the 
response time on the measure of interest over different structure lengths. Each one 


corresponds to a real Physiome hypothesis from the table of Fig. 6.29 The last 
hypothesis in that table, v = 379, is used for the tests of the four last graphs in Fig. 


6.28 viz., u-learning, u-factorization, u-propagation and conditioning. We have set 


different number of trials (ntrials) over it, each one having 1 MB. The last test 
in each of such four graphs, with IK trials, is processing 1 GB of uncertain data 
at once and then hts the machine’s main memory. We interpret the performance 
results shown in these graphs as follows for each measure of performance. 


• Extraction. Some fluctuation may be due to practicalities of XML DOM 
access methods. The point of this performance study is to have practical 
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HYPOTHESIS 

V 

name 

| 5 | 

1^1 


186 

Regulatory_Vessel 

40 

20 


89 

Myo_Dyn_Resp_wFit 

73 

28 


60 

Myogenic_Compliant_Vessel 

100 

38 


75 

Baroreceptor_Lu_et_al_2001 

153 

74 


70 

4-State_Sarcomere_Energetics 

298 

91 


120 

Comp_four_gen_weibeLlung 

440 

186 


91 

Cardiopulmonary Mechanics 

1132 

412 


93 

CardiopulmonMechGasBIoodExch 

1593 

525 


153 

HighlyIntegHuman 

1624 

538 


154 

HighlylntHuman wlntervention 

1919 

634 


379 

Baroreflex_SB_CT 

171 

74 


Figure 6.29. Physiome hypotheses used in the experiments. 


measures of the amount of time taken to process representative hypothesis 
structures. Note that even for structures of size \S\ = 2K the amount 
of time required to extract a hypothesis is kept at subsecond order of 
magnitude (interactive response time) in a personal machine. 


• Encoding. Some fluctuation is expected due to varying degrees of coupling 
between variables in the hypothesis structures. Note that, although |iS| 
provides a very good measure of their size and complexity, the extent to 
which they are intricate should cause impact on the encoding procedure, 
which is any case kept 0(i^/[^|5|). Again, the point here is to provide a 
notion of the amount of time required to encode representative hypotheses. 


For a scalability test on the encoding procedure, cf. Fig. 3.8 


• U-intro. The U-intro procedure is composed of u-factor learning^ u-factori- 
zation and u-propagation. We observed in previous tests that it was dom¬ 
inated by the learning component, viz., the discovery of occasional fd’s in 
the ‘big’ fact table. However, this is no longer the case once we imple¬ 
mented the workaround of keeping (in addition to the ‘big’ table) a table 
containing only the exogenous (input parameter) variables, as it has negli¬ 
gible size w.r.t. the data of the endogenous (predictive output) variables. 
Then u-learning became subsecond again and the U-intro procedure became 
dominated by u-propagation. In fact, the procedure of u-factorization, car¬ 
ried out once the fd’s are discovered by u-learning, is also sub-second then 
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has negligible processing time w.r.t. u-propagation. The latter is the most 
expensive snb-procednre of the synthesis method. 


• Conditioning. The conditioning procednre is rnn for a selected phenomenon. 
It is composed of fonr main parts. First, by operation conf() it performs 
a probabilistic inference sub-qnery on the proper predictive projection of 
the ‘big’ fact table of each hypothesis associated with the phenomenon. 
Second, it combines the resnlts of each snch snb-qnery throngh a nnion 
all qnery whose result set is a multi-hypothesis predictive table. Third, 
it loads the phenomenon observation sample data and the predictive data 
from the multi-hypothesis table into memory to apply Bayesian inference. 
Finally, the prior probability distribution of the predictive table is updated 
with the posterior and all the corresponding marginal probabilities are up¬ 
dated in their original U-relational tables. In our tests, this procedure is 
carried out over varying number of trials (ntrials). The total response times 
are shown in the last plot of Fig. 6.28[ 


This performance behavior is to be interpreted in the context of ETL in 
DW’s. Loading and setting up an T-DB has an overhead that shall be, though, 
much lower in high-performance machines. Such overhead is nonetheless justihed 
for the use case of hypothesis management and analytics as opposed to simulation 
data management and exploratory analytics (cf. ^2.6.2). 


6.5. Discussion 


We have verihed that the hypothesis ratings/rankings shown in 16.2 coincide 
with the results (e.g., of model tuning) described in the Physiome model entries and 
their related publications. That validates the applicability of the T-DB method¬ 
ology as a tool for data-driven analysis in such realistic scenarios. 

The current practice in Computational Science for model evaluation and 
comparison in the presence of observational data is somewhat handcrafted: model 
agreement is assessed either qualitatively by referring to curve shapes in data plots 
or quantitatively by means of ad-hoc scripts. The T-DB methodology offers a 
tool to perform data-driven hypothesis analytics semi-automatically directly in the 
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database under the support of its querying capabilities. It has, therefore, potential 
to be a step towards higher standards of reproducibility and scalability. 

Realistic assumptions. The core assumption of our framework is that the 
hypotheses are given in a formal specification which is encodable into a SEM that 
is complete (satisfies Defs. 0I1' Also, as a semantic assumption which is standard 
in scientific modeling, we consider a one-to-one correspondence between real-world 
entities and variable/attribute symbols within a structure, and that all of them 
must appear in some of its equations/fd’s. For most science use cases involving 
deterministic models (if not all), such assumptions are quite reasonable. It can be 
a topic of future work (cf. 


to explore business use cases as well. 

Hypothesis learning. The (user) method for hypothesis formation is ir¬ 
relevant to our framework, as long as the resulting hypothesis is encodable into a 
SEM. So, a promising use case is to incorporate machine learning methods into our 
framework to scale up the formation/extraction of hypotheses and evaluate them 
under the querying capabilities of a p-DB. Consider, e.g., learning the equations, 
say, from Eureqa 00 

Qualitative hypotheses. The T-DB methodology is primarely motivated 
by computational science (usually involving differential equations). It is, however, 
applicable to qualitative deterministic models as well. Boolean Networks, e.g., 
consist in sets of functions f{xi,X 2 ,..,Xn), where / is a Boolean expression. For 


instance. Fig. |6.30| presents the system of Boolean equations of a tentative Boolean 
Network model for a plant hormone (Fig. 6.31) published in [ZS]jj The notation. 


e.g., SphK*, is read (just like an ordinary differential equation), ‘the next state 
value of variable SphK is given by the state value of variable ABA. The parameters 
in this kind of model are the variable initial conditions. 


® http://creativemachines.Cornell.edu/Eureqa, 

® Cf. http;//atlas.bx.psu.edu/booleannet/booleannet.html 
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■H = { SphK* 

_ 

ABA, 

SIP* 

= 

SphK, 

GPAl* 

= 

(SIP or not GCRl) and AGBl, 

PLD* 

= 

GPAl, 

PA* 

= 

PLD, 

pHc* 

= 

ABA, 

OSTl* 

= 

ABA, 

ROP2* 

= 

PA, 

Atrboh* 

= 

pHc and OSTl and ROP2 and not ABU, 

ROS* 

= 

Atrboh 

H+ATPase* 

= 

not ROS and not pHc and not Ca2‘^c 

ABIl* 

= 

pHc and not PA and not ROS, 

RCNl* 

= 

ABA, 

NIA12* 

= 

RCNl, 

NOS* 

= 

Ca2+c, 

NO* 

= 

NIA12 and NOS, 

GC* 

= 

NO, 

ADPRc* 

= 

NO, 

cADPR* 

= 

ADPRc, 

cGMP* 

= 

GC, 

PLC* 

= 

ABA and Ca2^c, 

lnsP3* 

= 

PLC, 

InsPK* 

= 

ABA, 

lnsP6* 

= 

InsPK, 

CIS* 

= 

(cGMP and cADPR) or (lnsP3 and lnsP6), 

Ca2+ATPase* 

= 

Ca2+c, 

Ca2+c * 

= 

(CalM or CIS) and (not Ca2”*”ATPase), 

AnionEM* 

= 

((Ca2"^c or pHc) and not ABU ) or (Ca2”^c and pHc), 

Depolar* 

= 

KEV or AnionEM or (not H+ATPase) or (not KOUT) or Ca2'^c, 

CalM* 

= 

(ROS or not ERAl or not ABHl) and (not Depolar), 

KOUT* 

= 

(pHc or not ROS or not NO) and Depolar, 

KAP* 

= 

(not pHc or not Ca2'^c) and Depolar, 

KEV* 

= 

Ca2+c, 

PEPC* 

= 

not ABA, 

Malate* 

= 

PEPC and not ABA and not AnionEM, 

RACl* 

= 

not ABA and not ABU, 

Actin* 

= 

Ca2'*”c or not RACl, 

Closure* 


(KOUT or KAP ) and AnionEM and Actin and not Malate }. 


Figure 6.30. Example of Boolean Network hypothesis. 




6.6. CONCLUSIONS 


101 



Figure 6.31. Example of Boolean Network model (source: I7B1). 


Several kinds of dynamical system can be modeled in this formalism. Ap¬ 
plications have grown out of gene regulatory network to social network and stock 
market predictive analytics. Even if richer semantics is considered (e.g., fuzzy log¬ 
ics), our encoding method is applicable likewise, as long as the equations are still 
deterministic. 

6.6. Conclusions 

In this chapter we have demonstrated and discussed the applicability of the T-DB 
methodology. We have referred to real-world use case scenarios derived from the 
Physiome research project. We have shown in some detail the process of building 
an T-DB with representative models from Physiome’s model repository. That 
qualitative assessment is followed by experiments that provide some concrete feel 
on the performance behavior of T-DB for models with up to 600-1- mathematical 
variables. 




























































Chapter 7 


Conclusions 


In this chapter we (^7.1) revisit the research questions addressed by this 


thesis, (^7.2) point out its significance and limitations, (|:7.3) list open problems 


and topics for future work, and (^7.4) conclude with final considerations. 


7.1. Revisiting the Research Questions 

Let us now revisit the conceptual (RQl-4) and technical (RQ5-9) research 
questions. 

RQl. How to define and encode hypotheses ‘as data’? What are the sources of 
uncertainty that may be present and should be considered? 

In Chapter we have provided core abstractions that compose the vision 
of hypotheses ‘as data,’ or the T-DB vision. The problem of hypothesis 
encoding has been defined and addressed further in Chapter We have 
distinguished two main sources of uncertainty in our model of uncertainty 
for hypothesis management, viz., (i) theoretical uncertainty, as arising from 
competing hypotheses; and (ii) empirical uncertainty, as arising from the 
alternative trial datasets available for each hypothesis. 

RQ2. How does hypotheses ‘as data’ relate with observational data or, likewise, 
phenomena ‘as data’ from a database perspective? 

Also in Chapter we have presented a conceptual framework in which 
we have defined hypotheses ‘as data’ and shown how it can be compared 
against phenomena ‘as data.’ In fact, hypothesis management is really 






7.1. REVISITING THE RESEARCH QUESTIONS 


103 


significant when it is possible to rate/rank hypotheses in the presence of 
(some partial piece of) evidence. 


RQ3. Does every piece of simulated data qualify as a scientific hypothesis? What 
is the difference between managing ‘simulation’ data from managing ‘hy¬ 
potheses’ as data? 


Early in Table |1.1[ we provided a comparison between simnlation data 
management and hypothesis data management. Fnrthermore, the scientihc 
research process is abstracted in Chapter as a well-defined problem of 
data cleaning. Hypotheses are seen from an applied science point of view, 
and then are rednced into data snch that a piece of simnlation data is 
considered a hypothesis whenever it is assigned to explain some specihc 
phenomenon. 


RQ4. Is there available a proper (machine-readable) data format we can use to 
automatically extract mathematically-expressed hypotheses from? 

We anticipated in Chapter the adoption of the XML data model as 
the general data format for extracting hypothesis specifications from. In 
particnlar, since we deal here with mathematical hypotheses, we refer to 
Math ML as a standard for hypothesis specihcation. Concretely, in Chapter 
l^we present nse case demonstration scenarios for which we have developed 
a specihc wrapper, viz., for the extraction of hypotheses specihed in MML 
(Mathematical Modeling Langnage). 

RQ5. Is there an algorithm to, given a SEM, efficiently extract its causal order¬ 
ing? What are the computational properties of this problem? 

As shown in Chapter Simon’s treatment of the problem of cansal or¬ 
dering given a SEM iS(£^,V) is NP-Hard. In the same chapter, we have 
discnssed this problem in detail and presented an effective, efficient algo¬ 
rithmic approach to the problem. The compntational cost for the whole 
process of hypothesis encoding is bonnded by 0(y^[5||£^|). Experiments 
show that the approach performs well in practice for large hypotheses. 
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RQ6. What is the connection between SEM’s and fd’s? Can we devise an en¬ 
coding scheme to ‘orient equations ’ and then effectively transform one into 
the other with guarantees? Once we do it, what design-theoretic properties 
have such a set of fd’s? 

Also in Chapter we have presented an algorithmic encoding scheme to 
transform a SEM into a set of fd’s with guarantees in terms of preserving 
the hypothesis causal structure. Our study of this problem has revealed 
some interesting properties of the resulting fd sets, in particular, that they 
are always ‘non-redundant’ and, in comparison with arbitrary information 
systems, more precise and economical in the sense that, for any given 
attribute, there is exacly one fd with it in its rhs. 

RQ7. Is such fd set ready to be used for p-DB schema synthesis as an encoding 
of the hypothesis causal structure? If not, what kind of further processing 
we have to do? Can we perform it efficiently by reasoning directly on the 
fd’s? How does it relate to the SEM’s causal ordering? 

As we discuss in Chapter the encoded fd set must be further processed 
to find the ‘first causes’ for each of its predictive variable. For addressing 
that, in Chapter we have presented the concept of the folding of an fd 
set and an efficient algorithm to compute it. Also, we have shown the 
equivalence of such fd reasoning with causal ordering processing. 

RQ8. Is the uncertainty decomposition required for predictive analytics reducible 
to the structure level (fd processing), or do we need to process the simulated 
data to identify additional uncertainty factors? Einally, what properties are 
desirable for a p-DB schema targeted at hypothesis management? Are they 
ensured by this synthesis method? 

In Chapter we have presented a conceptual framework to address syn¬ 
thesis for uncertainty ‘4U.’ In particular, we have introduced the need 
to process, for each hypothesis, its trial datasets available, and presented 
an efficient algorithm to factorize and propagate the overall uncertainty 
present in a given hypothesis (as a competing explanation for a target 
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phenomenon). Then we have motivated BCNF as a notion of “good” de¬ 
sign w.r.t. the factorized fd set based on the folding concept, and the 
lossless join property as required for the correctness of uncertainty decom¬ 
position. We have shown that the synthesized p-DB schema bears both 
properties. 

RQ9. Given all such a design-theoretic machinery to process hypotheses into 
(U-)relational DB’s, what properties can we detect on the hypotheses back 
at the conceptual level? Do we have now technical means to speak of hy¬ 
potheses that are “good” in terms of principles of the philosophy of science? 

Equipped with the design-theoretic machinery proposed in this thesis, we 
are able to, given a SEM, to automatically (1) extract its causal ordering, 
(2) detect its strongly coupled components and decide, for a given predic¬ 
tive projection, what are its associated u-factor projections (if any), and 
shall be able as well to (3) query the hypothesis ranking for a phenomenon 
of interest. All these are technical means to [15]: (1') extract the hypothe¬ 
sis ‘empirical content’ and ‘predictive power;’ (2') unravel its cohesiveness 
and how parsimonious it is in terms of the number of different claims or 
epistemological units carried within it, as well as its empirical grounding 
(‘Erst causes’); and hnally, we shall be able to (3') appraise it in face of 
competing or alternative explanations. 

7.2. Significance and Limitations 

This thesis addresses the pressing call for large-scale, data-driven hypoth¬ 
esis management and analytics [35l [H] [TH]. Some reasons that contribute for its 
signihcance are listed (cf. [TOl 1^ 1^)- 

• Structured deterministic hypotheses are now shown to be encodable as 
uncertain and probabilistic (U-relational) data based on p-DB principles; 

* Study of the connection between SEM’s and fd’s, with contribution 
both to computational properties of the causal ordering problem, and 
to causal reasoning over fd’s; 
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* First synthesis method for the construction of p-DB’s from some pre¬ 
vious existing formal specihcation. 

• Dehnition of a concrete use case of data-driven hypothesis management 
and analytics; 

* New class of applications introduced for p-DB’s; 

* Settled the problem of Bayes’ conditioning in p-DB’s. 

Now some limitations of the thesis are listed. 

• The Bayesian inference is implemented at application level, yet not formu¬ 
lated as a principled technical solution within research in p-DB’s. 

• The encoding scheme to transform the mathematical system of a hypoth¬ 
esis into a set of fd’s enabling the synthesis of the p-DB is applicable to 
structured deterministic models only, not stochastic ones. 

7.3. Open Problems and Future Work 

Open problems and topics of future work are listed (no particular order). 

(1) The design of a dedicated algebraic operation for Bayes’ conditioning in 

p-WSA. 

(2) Investigation of other data dependency formalisms (e.g., multi-valued de¬ 
pendencies izni), approximate fd’s |68], conditional fd •s|7Sl) to extend the 
scope of T-DB towards structured stochastic models. 

(3) Development of techniques for systematic hypothesis extraction as a well- 
defined problem of (web) information extraction; 

(4) Investigation of business use case scenarios for data-driven decision making 
on top of T-DB; 

(5) Dehnition of a machine learning use case scenario to industrialize hypothe¬ 
sis formation and assess T-DB’s performance feasibility in such a scenario; 
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(6) Development of automatic data sampling techniques to leverage the data 
dehnition of both hypotheses and phenomena in T-DB from a statistical 
point of view. 

7.4. Final Considerations 

In this thesis we have developed the vision of T-DB, which is essentially 
an abstraction of hypotheses as uncertain and probabilistic data. It comprises 
a design-theoretic methodology for the systematic construction and management 
of U-relational hypothesis DB’s. It is meant to provide a principled approach 
to enable scientists and engineers to manage and evaluate (rate/rank) large-scale 
scientihc hypotheses as theoretical data. We have addressed some core technical 
challenges over the T-DB vision in order to properly encode deterministic hypothe¬ 
ses as uncertain and probabilistic data. 

As envisioned by Jim Gray [T], the scientihc method has been shifting towards 
being operated as a data-driven discipline which is rapidly gaining ground j^. In 
this thesis we have strived for proposing some core principles and techniques for 
enabling data-driven hypothesis management and analytics, opening a promising 
line of research in both probabilistic databases and simulation data management. 
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Appendix A 


Detailed Proofs 


A.l. Proofs of Hypothesis Encoding 

A.1.1 Proof of Theorem 

“Let S{£,V) he a complete structure. Then the extraction of its causal ordering by 
Simon’s COA{S) is intractable (NP-Hard).” 

Proof 17 We show that, at each recursive step of COA, to hnd all non-trivial 
minimal subsets (i.e., \£'\ > 2) translates into an optimization problem associated 
with the decision problem BPBP, which we know by Lemmaj^to be NP-Complete. 

First, recall (Def. that a structure S{£, V) is complete if |£^| = |V|; e.g., for 
the structure given in Fig. (left), note (Def. the minimal structure S'{£\ V'), 
where £' = { /i, / 2 , /s }. For non-trivial minimal structures, i.e., when \£'\ = K > 
2, it is easy to see that its corresponding bipartite graph G = (y( U V 2 , E'), where 
Vi £, V 2 ^ V and E S must have number of edges \E'\ > 2K and, for all 
its vertices u E V( U V 2 , u must have deg{u) > 2, i.e., G is a pseudo-biclique in 
accordance with Def. That intuition is elaborated as follows. 

The point is that, no matter how big is such structure S', its equations f E £' 
are such that |Dars(/)| > 2 (as S' is non-trivial) and its variables can be grouped 
in local patterns from the sparsest kind to the densest. To construct an instance of 
the sparsest case, let S' be built by setting a hrst equation where its entry in the 
structure matrix has form (1,1,0^) and then, for the next \£'\ — 2 equations, 
shift such pair of I’s one position right w.r.t. the previous one. Then complete it 
with a last equation whose form is form (1,0’'', 1). That is, the structure is built 
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with unique pairs of I’s spread all over the structure. Then, deciding whether there 
is a minimal structure of size \£'\ = K corresponds exactly to BPBP. It is a special 
case (BBP), when such minimal structure is the densest possible, i.e., when 
has only I’s and then its corresponding bipartite graph is a iP-balanced biclique 
with \E'\ = and deg{u) = K for all vertices u G P/ U For instance, see the 
minimal structure with £' = { / 4 , /s } found at the second recursive step of COA 
in Fig. |3.2[ □ 


A.1.2 Proof of Proposition 

“Let 5(£^,V) he a structure, and ipi : 8 ^ V and ip 2 '■ 8 ^ V be any two total 
causal mappings over S. Then = C 2 . ” 


Proof 18 The proof is based on an argument from Nayak [l9], which we present 
here arguably much clearer. Intuitively, it shows that if Lpi and <p 2 differ on the 
variable an equation / is mapped to, then such variables, viz., (pi{f) and </? 2 (/), 
must be causally dependent on each other (strongly coupled). 

To show Cl = C 2 reduces to Cf C and C We show the first 
containment, with the second being understood as following by symmetry. Closure 
operators are extensive, X C cl{X), and idempotent, c/(c/(X)) = cl{X). That is, if 
we have Ci C C 2 , then we shall have Cf C and, by idempotence, Ci C C 2 . 

Then it suffices to show that Ci C C 2 , i.e., for any {x', x) G Ci, we must show 
that {x', x) G C 2 as well. Observe by Def. |^that both ipi and Lp 2 are bijections, 
then, invertible functions. If g)i^{x) = (p^^(a:), then we have (P, x) G C 2 and thus, 
trivially, (x', x) G Else, (pi and ^2 disagree in which equations they map onto 
X. But we show next, in any case, that we shall have (x', x) G C^- 

Take all equations g ^ 8 ' 8 such that g^i{g) 7 ^ ^ 2 ( 9 ), and let n < \ 8 \ be 

the number of such ‘disagreed’ equations. Now, let / G be such that its mapped 
variable is x = V3i(/). Construct a sequence of length 2 n such that, sq = Ti{f) = ^ 
and, for 1 < -i < 2n, element Sj is defined Sj = ip 2 {(Pi^(si^i)). That is, we are 
defining the sequence such that, for each equation g G 8 ', its disagreed mappings 
(pi{g) = Xa and (^ 2 ( 5 ') = Xb are such that (pi{g) is immediately followed by (^ 2 ( 5 ')- 
As Xa, Xfe G Vars{g), we have (x^, Xb) G C 2 and, symmetrically, (x^, Xa) G Ci. 




A.1. PROOFS OF HYPOTHESIS ENCODING 


118 


The sequence is of form s = {x, Xf,... ,Xa, Xb, ■ ■ ■, X 2 n-i, X 2 n)- 

f 9 h 

Since x must be in the codomain of 932 , we must have a repetition of x at 
some point 2 <k <2nin the sequence index, with Sk = x and s^-i = x” such that 
(x", x) G € 2 - If x” = x', then (x', x) G C 2 and obviously (x', x) G Else, note 
that Xf must also be in the codomain of (pi, while x” in the codomain of ip 2 - Let i 
be the point in the sequence, 3 < i < 2n — l, at which si = Xf = Xa and = Xb 
for some Xb such that (x/, Xb) G C 2 - It is easy to see that, either we have Xb = x" 
or Xb 7 ^ x" but (xfo, x") G C 2 ■ Thus, by transitivity on such a causal chain, we 
must have (x/, x") G C 2 and eventually (x/, x) G C 2 ■ Finally, since x' G Vars{f) 
and (p 2 {f) = Xf, we have (x', x/) G C 2 and, by transitivity, (x', x) G C 2 ■ □ 

A.1.3 Proof of Theorem 

“Let S he an fd set defined S= h-encode(S ) for some complete structure S. Then 
S is non-redundant and singleton-rhs but may not he left-reduced (then may not be 
canonical). ” 

Proof 19 We will show that properties (a-b) of Def. hold for S produced by 
(Alg. h-encode, but property (c) may not hold. 

At initialization, the algorithm sets S = 0 and then inserts an fd {X, A) G S 
for each (/, x) G (pt scanned, where x 1 —>■ A and X fl A = 0. At termination, for 
all fd’s in S we obviously have |A| = 1 then property (a) holds. Also, note that 
ip: Vars{S) is, by Def. a bijection. 

Now, for property (b) not to hold there must be some fd {X, A) G S that 
is redundant and then can be found in the closure of T = S \ {X, A). By Lemma 
1^ (below), that can be the case only if A C X or there is (Y, A) G T for some Y. 
But from X fl A = 0, we have A ^ X; and from p being a bijection it follows that 
there can be no such fd in T. Thus it must be the case that E is non-redundant, 
i.e., property (b) holds. 

Finally, property (c) does not hold if there can be some fd (X, A) G E with 
Y C X such that T = E \ (X, A) U (Y, A) has the same closure as E. That is, if 
we may find (X, A) G E"*". Now, pick structure S whose (3 x 3) matrix As has rows 
(1, 0, 0), (1,1, 0), (1,1,1) as an instance. Alg. [^encodes it into E = {0—)■ xi, Xi u—)■ 
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a^2, xiX2V^ X3}. Let Y = {xi^v}, and B = {x2}- Note that n — )■ X2 G S can be 
written as {Y,B) G E, and a:i 0:2 n —>■ X 3 G E as {YB,A) G E. Now observe that 
{Y, A) G E+ can be derived by R5 over (F, S), (F-B, A) G E, which is snfficient to 
show that property (c) may not hold. That is, B is “extraneous” in {YB,A) G E 
and can be removed from its Ihs without loss of information to E. □ 

Lemma 4 Let E be a (Def. |^a) singleton-rhs fd set on attributes U. Then X ^ A 
can only be in E+, where XA Y U, ii A Y X or there is non-trivial (F, ^4) G E for 
some Y G U. 

Proof 20 By Lemma [s] (below), we know that X — )■ A G E+ iff AC We need 
to prove that if A ^ X and there is no F —)■ A in singleton-rhs E, then A ^ X+. 
But this is equivalent to show that (Alg. XCIosure gives only correct answers for 
X~^ w.r.t. E, which is known (cf. theorem from Ullman [20l p. 389]). Note that 
XCIosure(E, X) inserts A in X+ only if A C X or there is some fd (F, A) G E. □ 

Lemma 5 Let E be an fd set. An fd X —)■ F is in E+ iff F C X"*", where X+ is 
the attribute closure of X w.r.t. E. 

Proof 21 This is from Ullman [20l p. 386]. Let F = Ai... A„ and suppose F C 
X"*". Then for each A*, we have Aj G X^ and, by the dehnition of X"*", we must 
have (X, Aj) G E"*". Then it follows by (R4) union that X —>■ F is in E+ as 
well. Conversely, suppose (X, F) G E'*". Then, by (R3) decomposition we have 
(X, Aj) G E"*" for each Aj G F. □ 

A.2. Proofs of Causal Reasoning 

A.2.1 Proof of Lemma 

“Let V) be a complete structure, ip a total causal mapping over S and E an 
fd set encoded through ip given S. If (X, A) G E, then A"^, the attribute folding of 
A (w.r.t. J2) exists and is unigue.” 

Proof 22 The existance of A'’^ is ensured by the degenerate case where X = A'’^, 
as X —)■ A is itself in E^ by an empty application of R5. If X —)■ A is in fact folded 
w.r.t. E, then the folding of A exists. Else, it is not folded yet X —)■ A is non-trivial 
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because by Theorem is non-redundant. Then, by Def. H there must be some 


Y C U with Y ^ X such that Y —)■ X is in and X Y. By Def. 10, there 
is a hnite application of R5 over fd’s in S to derive Y X. Then by R2~R5 
over X —)■ A, we have Y A. Although there may be many such (intermediate) 
attribute sets Y <Z U along the transitive chaining satisfying the conditions above, 
we claim there is at least one that is a folding of A. Suppose not. Then, for all 
such Y C U, there is some Y' C U with Y' ^Y such that Y' ^ Y and Y Y\ 
leading to an inhnite regress. Nonetheless, in so far as cycles are ruled out by force 


of Def. IT, then E+ must have an inhnite number of fd’s. But S’*' is hnite, viz., 
bounded by 221^^1 (cf. [2ll p. 165]). i . Therefore the folding of A must exist. 

Moreover, observe that E is encoded through (p, which is by Def. [^a bijection. 
Then we have (X, T) G S for exactly one attribute set X. Then, as a straightfoward 
follow-up of the rationale that led us to infer the folding existance, note that there 
must be a single chaining Y’^ Y^ Y^ X A. Again, as cycles are 


ruled out by force of Def. 11 and E’*' is hnite, then the folding of A is unique. □ 


A.2.2 Proof of Theorem 

“Let S(L,V) be a complete structure, and E an fd set encoded given S. Now, let 
(X, A) G E. Then AFolding(T,, A) correctly computes AX, the attribute folding of 
A (w.r.t. T) in time 0(|5p).” 

Proof 23 For the proof roadmap, note that AFolding is monotone (size of A* can 
only increase) and terminates precisely when AX^) = A^''\ where denotes the 
attributes in A* at step i of the outer loop. The folding A'’^ of A at step i is 
A(®) \ A(®b We shall prove by induction, given attribute A from fd X —)• A in E, 
that A*\ A returned by AFolding(E, A) is the unique attribute folding A'’^ of A. 

(Base case). By Theorem E is non-redundant with (then) non-trivial 
(X, A) G E for exactly one attribute set X, the algorithm always reaches step 
z = 1, which is our base case. Then X is placed in Ad) and A in Ad), and we 
have Ad) = XA and Ad) = A. Therefore, Ad) \ Ad) = X, and in fact we have 
(X, A) G E^ by an empty application of R5. For it to be specihcally in E'^^C E^, 
it must be folded w.r.t. set A of consumed fd’s at this step, viz.. Ad) = {X —)■ A}. 


In fact, as the only fd in Ad), by Def. 11 it must be folded w.r.t. Ad), and we 
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have = X at step i = l. 

(Induction). Now, let i = ioi k > 1, and assume that \ A) G 
C with ^ By Lemma we know that = A^^'^ \ is the 

unique folding of A at step i = k. For the inductive step, suppose Y is placed in 


j^{k+i) \{k+i) because {Y, i?) G S \ and B G A^’^K 

Since B G A^^'^ and B ^ (it is yet just be consumed into we can 

write (A^^^\A^^^) = ZB for some Z ^ B, where (A^^^ \ A*-^^) —)■ A is assumed in SA 


ZB 


Now, with the application of R5 consuming Y ^ B we have {A^^^YB \ A^^^B) —)■ A, 


zs 


where S = Y \ A*^^b We claim that ZS^ A is folded w.r.t. 

there must be some W ^ ZS such that 


Suppose not. Then by Def. 


11 


W —)■ ZS is in (A*^^+^))+ but ZS -jA W. Since ZS ^ 0 , there must be some 
C G ZS, i.e., C ^ A(^+b_ Note that, as W ^ ZS is in (A*^^+^))+, then by (R3) 
decomposition we have IF —>■ C in (A^^"*"^))’*' as well. But by Lemma that can 
only be the case if there is some (T, C) G which means C has been already 

consumed into though C ^ 

Finally, as for the time bound, note that in worst case, exactly one fd R —)■ R 
is consumed from S into A for each step of the outer loop, where |S| = \£\. That 
is, let n = \£\, then n is decreased stepwise in arithmetic progression such that 
n + (n—1) + ... + 1 = n (n—1)/2 scans are required overall, i.e., OijnS). Note also, 
however, that B may be the only symbol read at each such fd scan but in worst 
case at most \U\ = |V| symbols are read. Thus our measure n should be actually 
overestimated n = \£\ |V| = |R|. Therefore Alg. [^is bounded by 0(|Rp). □ 


A.2.3 Proof of Corollary 

“Let 5(T, V) he a complete structure, and S an fd set encoded given S. Then algo¬ 
rithm foldingfZ) correctly computes the folding ofT, in time that is /(|5|) 0(|T|), 
where /(|5|) is the time complexity of (Alg. AFolding. ” 


Proof 24 By Theorem]^ we know that sub-procedure (Alg. AFolding is correct 
and terminates. Then (Alg. folding necessarily inserts in (initialized empty) 
exactly one fd Z —> A for each fd X —)■ A in S scanned. Thus, at termination we 
have = |S|. Again, as AFolding is correct, we know Z is the unique folding of 
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A. Therefore it must be the case that Alg. |^is correct. Finally, for the time bound, 
the algorithm iterates over each fd in S without having to read its symbols, and at 
each such step AFolding takes time that is f(n). Thus folding takes /(|5|) 0(|T|). 
But we know from Theorem]^ and Remark]^ that /(|5|) G (9(|iS|), then it takes 
0 (| 5 ||£|). □ 

A.2.4 Proof of Proposition 

“Let S(L,V) be a eomplete structure, ip a total causal mapping over S and S 
an fd set encoded through ip given S. Let he the folding of S, then is 
parsimonious. ” 

Proof 25 By Lemmaj^we know that, for each fd (X, A) G S, the attribute folding 
Z oi A such that Z —> A exists and is unique. That is, for no R 7 ^ Z we have 

A ml A o n + o f i o o 1U r o o f i ofi oo 'Hof ][3 JoUg aS We ShOW 


Y A. Thus = folding(S) automatically satishes Def. 
it is canonical (cf. Def. [^. 

Moreover, by Theoremj^we know that S is both non-redundant and singleton- 
rhs. Now, consider by Lemma that AFolding builds a bijection mapping each 
{X, A) G S to exactly one {Z, A) G S'^^such that Z ^ A. Since S is singleton-rhs, 
it is obvious that is as well and covers all attributes in the rhs of fd’s in S. 
Also, the bijection implies = |E|. Since E is non-redundant and has exactly 
one fd with each attribute A in its rhs, then by Lemma so is ER 

Finally we will show that unlike E, its folding E"'^ must be left-reduced. 
Suppose not. Then for some fd Z —)■ A in E"'^ there is S <Z Z such that non-trivial 
S'—)■ A is in (E'’^)’*'. Since Z —)■ A is the only fd in E'’^ with A in its rhs and S'—>■ A 
is non-trivial, we must have S' A Z A A. 

Now, suppose S' —)■ A is not folded. Then there is FF ^ S' such that FF —)■ S' 
is in (E‘’^)+ but S' tA FF. Note that FF 7 ^ Z, as FF ^ S'. Also, FF —)• S' and 
S —)■ ^implies FF —)■ Z by (R5) transitivity. Note also that S tA FF and S —)■ Z 
implies Z 7 A FF. But Z —)■ A is assumed folded. ii. That is. S' —)■ A must be 
folded. Then we have both S —)■ A and Z —)■ A folded, though S ^ Z. That is, the 
attribute folding of A is not unique, even though we know by Lemma|^that it must 
be unique, i. Thus E'’^ must be left-reduced, altogether, therefore, parsimonious. 
□ 
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A.2.5 Proof of Theorem 

“Let S{£,V) be a complete structure, ^ a total causal mapping over S and S an 
fd set encoded through tp given S. Then Xa,Xb G V are such that Xb is causally 
dependent on Xa, ie., {xa,Xb) G ijf there is some non-trivial fd {X,B) G 
with A E X, where B Xb and A Xa-” 


Proof 26 We prove the statement by induction. We consider first the ‘if’ direc¬ 
tion, and then its ‘only if’ converse. 

(Base case). Let {X,B) G S be some fd with A E X, where B ^ Xb 
and A i—)■ Xa- By Theorem]^ it is non-trivial and then by default (i.e., an empty 
application of R5) it is in E^. But as W —I? is in S, (Alg. h-encode ensures 
that there is exactly one equation f E £ such that p{f) = Xb and Xa E Vars{f) 


where B Xb and A i—>• Xa- Then by force of Eq. 3.1 we must have {xa, Xb) E C^. 
Thus, we obviously have {xa, Xb) E as well. 

(Induction) . Now, recall Armstrong’s (R5) pseudo-transitivity rule adapted 
here for the particular case of singleton-rhs fd sets, viz., if R —)■ C and CZ ^ 
B, then YZ ^ B. By the inductive hypothesis, take any two non-trivial fd’s 
{Y, C), {CZ, B) E E^ with B ^Y and assume that the causal dependency property 
holds for their attributes that encode variables. That is, let H G R and E E Z, 
where D Xd and E Xe for Xd, Xe eV such that [xd, Xc), {xe, Xb), {xc, Xb) E C^. 
Note that both Y ^ C and CZ ^ B are non-trivial, then C ^ Y, B ^ Z and 
B ^ C. Moreover, B ^Y has been assumed such that the fd {YZ, B) E E^ to be 
derived by R5 over Y ^ C and CZ ^ B is also non-trivial to satisfy the condition 
of the theorem. Now, it is easy to see that the property holds likewise for non¬ 
trivial fd (RR, R) G E^. In fact, {xd,Xc), {xc,Xb) E implies {xd,Xb) E and 
also by the inductive hypothesis we have {xe,Xb) E C^. That is, for either some 
D eY OT some E E Z, we must have {xd,Xb), {xe,Xb) E 

The converse ‘only if’ direction can be shown by a symmetrical inductive 
argument. That is, for the base case suppose {xa,Xb) E Then, by Eq. 3.1 we 


know there is some f E £ such that p{f) = Xb and Xa E Vars{f). Moreover, in 
that case (Alg. h-encode ensures there must be some non-trivial fd {X,B) E E 
with A E X where B eE- Xb and A eE- x a- Thus by an empty application of R5 we 
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have {X,B) G The inductive step shows the property still holds for arbitrary 
causal dependencies in . □ 

A.2.6 Proof of Proposition 

“Let V) be a structure with variable x G V. Then x can only be a first cause 
of some y E V if X is exogenous. Accordingly, any variable y E V can only have 
some first cause x E V if it is endogenous. ” 

Proof 27 The proof is straightforward from dehnitons. For the hrst statement, 
suppose by contradiction that x G V is not exogenous but is a hrst cause of some 
y E V. By Def. is bijective then there is some f E 8 such that </?(/) = x. 

Moreover, as x is not exogenous then by Def. it must be endogenous. In other 
words, there must be some Xq G V such that Xa x E Vars{f) and then by Eq. 
we have (xa,x) G hence {xa,x) G However, as x is a hrst cause, by 
there can be no ?/ G V such that {y, x) E C^. i. 


3.1 


Def. 


14 


Now, a symmetrical argument proves the second statement. Also by contra¬ 
diction take a variable y eV that is not endogenous and suppose it has some hrst 
cause X G V. As variable y is not endogenous then by Def. [^it must be exogenous. 
In other words, there must be / G T such that Vars{f) = {y}. Thus for any total 
causal mapping (p over S, we must have p{f) = y and, for no x G V, we have 
(x, y) E Therefore it is not possible to derive (x, y) E for some x G V. But 
as y has some hrst cause x G V by assumption, we must have (x, y) E i. □ 


A.2.7 Proof of Lemma 

“Let S{8,V) be a complete structure, ip a total causal mapping over S and E an 
fd set encoded through p given S. Then a variable Xq G V can only be a first cause 
of some variable Xb E V, where {X,B) E E, and B Xb, A Xa, if either (i) 
A E X or (a) A ^ X but there is {Z, C) E E^ with A E Z and C E X.” 

Proof 28 We prove the statement by construction out of Theorem 


By Def. M, one of the conditions for Xa to be a hrst cause of Xb is that 
{xa,Xb) E Moreover, by Theorem we know that (xa,Xfc) G (7+ can only 

hold if there is some non-trivial fd {Z,B) E E^ with A E Z, where B ^ Xb and 
A I—)■ Xa- Now, by Def. [^(p is bijective then there is {X,B) E E. Moreover, since 
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E is parsimonious, X —)■ i? is the only fd in E with B in its rhs. So let A ^ X. 
Then we know X ^ Z hence X ^ B cannot be the fd required by Theorem 

Then such id Z ^ B with A & Z can only exist if derived by some hnite 
application of R5. That is, there must be some (Z, C) G E with A & Z such that 
X = CW for some W and then R5 can be applied over {Z^C)^ {CW^B) G E to 
get non-trivial {ZW^ B) G E^ where A ^ Z. 

Now, it is easy to see that when such fd Z —)■ (7 with R G Z does not exists 
in E^ (the second condition of the lemma), then obviously Z —)■ C* cannot exist in 
E to satisfy the requirement imposed by Theorem That is, {xa,xi,) ^ C^. □ 

A.2.8 Proof of Theorem 

“Let S{£,V) be a complete structure, ip a total causal mapping over S and E an 
fd set encoded through p given S. Now, let B be an attribute that encodes some 
variable x^ G V. If {X,B) G T(E)‘^f] then every first cause Xa of Xb (if any) is 
encoded by some attribute A E X.” 


Proof 29 We show that the existance of a missing hrst cause Xc of Xb for folded 
X — > B, where B ^ Xb and C ^ Xc but C ^ X leads to a contradiction. 

Suppose, by contradiction, that there is some missing first cause Xc G V of 
Xb, where C t-E- Xc and C ^ X. Then, by Lemmasince variable Xc is a hrst cause 
of variable Xb, it must be exogenous and, for {Y,B) G T(E) either {i) C E Y or 
(ii) C ^Y but there is (Z, D) G T(E)^ with C E Z and some D eY. 

In the hrst case (i), since Xc is exogenous and E is parsimonious, we have 


(0, C) E T, but by Def. 15 there can be no IT —?• C* in the u-projection T(E) of E. 
That is, C cannot be ‘consumed’ by R5 and then (Y, B) E T(E) with C eY implies 
that, for any (IT, B) E T(E)^, we must have C E W. However, by assumption we 


have {X,B) E T(E)‘^then, by Def. ^ {X,B) E T(E)^, yetC ^X. 

In the second case (ii), observe that C E Z and D E Y, and let Y = DS. 
Then by R5 over Z ^ D and DS -E B we get {ZS,B) E T(E)^, where C G Z. 
Well, either ZS ^ B is folded or it is not, rendering two cases for analysis. If 
ZS -E B is, folded, then both ZS —)■ B and X ^ B are folded. But as T(E) is 


^ Note that the folding is taken w.r.t. the u-projection of S, then Xb where B Xb is an 
endogenous variable. 
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parsimonious, then by Lemma the folding of B must be unique. Therefore we 
must have ZS = X, with C G Z but C ^ X. 


Else, assume ZS —)■ i? is not folded. Then by Def. 11 there is some W with 


W ^ ZS such that non-trivial W —)■ ZS is in T(S)+ and ZS-/^ W. 

However, as C & Z and hh —)■ ZS, by (R3) decomposition we must have 
hh —)■ C in T(E)+, either with C G hh or with C and then IE —)■ (T is non¬ 
trivial. But we know the latter cannot be the case by the same argument used in 
the hrst case (i), viz., Xc is exogenous with C ^ Xc and S is parsimonious. That 
is, we must have C G W. Furthermore, as PE —)■ ZS in T(E)+, then by R5 over 
ZS —)■ R we get (PE, B) G T(S)^ with C G PE. Now it is easy to see that the same 
situation recurs to PE—?■ R. If it is not folded, eventually for some T we will have 
T PE R with C eT, where B will be folded just like X-)■ R. That is, 
by (Lemma the uniqueness of the folding of R, we will have T = X with C G T 
and C ^ X. i. □ 


A.3. Proofs of Probabilistic DB Synthesis 

A.3.1 Proof of Theorem 

“Let Sk and be (resp.) the eomplete structure and ‘big’ fact table of hypothesis 
k, and let Tj. be the repaired factorization of Sk over Hk, andYo the ‘explanation’ 
table where hypothesis k is recorded. Now, let Yk be a U-relational schema defined 
Yk = synthesize4u{Sk, Hk,Yo). Then Y^ is in BCNF w.r.t. Tj, and is minimal- 
cardinality. ” 


Proof 30 Let Yf[ViDi and Y^[VjDj I^T] be (resp.) any u-factor 

projection and predictive projection of Hk. Note that all fd’s in T'^ are either in 
<h(r(,) of form —)■ R or 0 —)■ R, or in T(rj.) of form Ai A 2 ... AiS ^ T with 
u G S'. We must show that no fd in (r(j)+ can violate Yf or Yl. It is easy to see 


that the projection (cf. Def. 20) of non-trivial fd’s in $(r'^)’'' onto Yf is empty, 
just like the projection of non-trivial fd’s in T(r(,)’'' onto Yf. 

For the u-factor projections, note by Def. that for any fd X —)■ C in 
(T^,)’*' to violate BCNF in Yf[ViDi | it must be non-trivial {C ^ X) with 

XC C (pAiGi but X -/A (pAiGi (that is, X is not a superkey for Yf). Note that we 
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have both (f)Ai^B and (j)^ AiB in for any B G Gi, bnt both cf) Ai and 0 

are snperkeys for Y^. Also, that there can be no non-trivial fd’s {X,C) G <h(r'^)’'' 
with (j) ^ X, and by the dehnition of Problem we know that is a maximal 

gronp. So, for any non-trivial {X,C) G X mnst be a snperkey for Yj^. 

Tims no n-factor projection can be snbject of BCNF violation w.r.t. Pj.. 

Now, for predictive projection Tjf [Vj Dj | S' T ] let ns reconstrnct the process 
towards deriving (S', T) G T(P(,)^. Note that it is derived by synthesize4u simnlat- 
ing i applications of R5 over (0, A*), (Ai A 2 ...A^ S, T) G P'^ for 0 < z < £. Note 
also that (i) no cyclic fd’s can be involved in snch R5 applications, as they mnst al¬ 
ways be over an fd in ‘h(Pfc); and (ii) as a resnlt of (Alg. merge, Ai A 2 ...A^ S—)■ T 
was the only non-trivial fd in the projection 'n'AiA 2 ...AeST(J''k)- Thns, the only non¬ 
trivial fd’s in the projection 71st{ (Tj,) in addition to S' —)■ T itself mnst be of 
form S' —?• G rendered ont of (R3) decomposition from it for all G G T. In any 
snch fd’s, we have S' as a snperkey for Tjf. Therefore no predictive projection can 
be snbject of BCNF violation w.r.t. F(,. 

For the minimality note, as a conseqnence of (Alg. merge, any two 
schemes Y^[XZ], YI^[VW] are rendered by synthesize4u into iff we have fd’s 
{X, Z), (fo, W) G (Fj,)’*' and X fo, i.e., it is not the case that both X —)■ fo and 
fo —)■ X hold in (Fj.)’*'. Now, to prove that is minimal-cardinality, we have to 
hnd that merging any snch pair of arbitrary schemes shall hinder BCNF in 
In fact, take := \ {Yl[XZ] U Y^[VW]) U Y^[XZVW]. As X fo, then 

neither X nor V can be a snperkey for Y^, which therefore cannot be in BCNF. □ 

A.3.2 Proof of Theorem 

“Let Sk be the complete structure of hypothesis k, and Hk\U] its ‘big’ fact table 
such that F(, is the repaired factorization of Sk over Hk and Yq is the ‘explanation’ 
table where hypothesis k is recorded. Now, let Yk be a U-relational schema defined 
Yk = synthesize4u{Sk, Hk, Yq) . Then, 

(a) the join Yj! [ fo Dj | 0 A* G* ] of any subset of the u-factor projections 
of Hk is lossless w.r.t. F(,. 

(b) any predictive projection Yf [Vj Dj \ S T], result of a join of the theoretical 
u-factor Yq[Vo Dq \ (j)v] with the ‘big’ fact table Hk[U] and in turn with 
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u-factor projections Yl[ViDi\cf) AiGi], is lossless w. r. t. . 


Proof 31 For item (a), by Lemma we know that any pair [Vi Di\ (f AiGi], 
Yf [Vj Dj I (j) Aj Gj ] of n-factor projections of will have a lossless join w.r.t. F'^ 
iff {(j) AiGi n (j) Aj Gj) {(j) AiGi\ (j) Aj Gj) or (0 AiGi fi 0 Aj Gj) —)■ (0 Aj Gj \ 


(pAiGi) hold in (F'^) + . By Def. 17, we know that {pAiGi fl pAjGj) = {0}, and 
(f) AiGi \ <p A j Gj = AiGi- In fact 0 —)■ Aj G* is a repaired fd in Fj., therefore Fj! 
and Yf have a lossless join. Now, since the join is an associative operation [201 P- 
62], and as we have chosen Yf and Yf arbitrarily, then clearly any snbset of the 
u-factor projections must have a lossless join. 


For item (b), for any predictive projection Yf[VjDj\ST] take the join txi 
Yf [ Id -Dj I 0 Aj Gj ] of u-factor projections such that, for all Aj G AjGj, we have 
Aj G IF C Z where S = Z\W and (2’, T) G Fj.. That is, Aj is a pivot attribute 
representing a hrst cause of some G G T. By item (a), we know that such join is 
lossless. 

We must show that the join txi Yp with ‘big’ fact table Hk\U] is also lossless. 


By Lemma ^ that is the case iff (0 Aj Gj fl f/) —)■ [p Aj Gj \ U) or (0 Aj Gi fl 


U) ^ {U \ (j) Aj Gj) hold in (F'^) + . In fact, we have (0 Aj Gj fl G) = 0 Aj Gj and 


(0 Aj Gi\U) = 0 such that cp AiGi ^ 0 is trivially in (Fj,)+. 

Finally, the join of theoretical u-factor Yq[VqDq\(I)v] with big fact table 
Hk\U] must be lossless likewise. In fact, note that (0n n G) = <pv, and (0n\G) = 
0. Then also trivially we have 0 , which is in (Fjj,)’'' as well. Since the join 

is commutative [20l p. 62], the order of application is irrelevant therefore the join 
of all joins examined above taken together must be lossless. □ 


Lemma 6 Let S be a set of fd’s on attributes G, and RpS], Rj[T] G -R[G] be 
relation schemes with ST C G; and let 7rs'r(E) be the projection of S onto ST. 
Then i?i[S'] and Rj[T] have a lossless join w.r.t. 7r5'T(F) iff (S' fl T) —)■ (S' \ T) or 
(^ n T) ^ (T \ S) hold in 7r5T(S)+. 


Proof 32 See Ullman [201 P- 397]. □ 

















